Beruflich Dokumente
Kultur Dokumente
Release 2.0
Memex Technology Limited
2 Redwood Court
Peel Park
East Kilbride G74 5PF
Scotland UK
Tel: +44 (0) 1355 233 804
Fax: +44 (0) 1355 239 676
Web: http://www.memex.com
Copyright 2007 Memex Technology Limited. All rights reserved.
This manual and the software described herein are the copyright of Memex Technology Limited and may not be
copied or disclosed to a third party without the prior written permission of Memex. Whilst all possible care is taken in
the preparation of this manual, Memex assumes no responsibility or liability for any errors or inaccuracies that may
appear in this document. Memex reserves the right to make changes without notice both to this manual and to the
software and hardware it describes.
The software described in this document is furnished under licence and may only be used in accordance with the
terms of such licence.
The people, places, organisations, telephone numbers, vehicle identification numbers and other details referred to in
the sample record data in this publication are entirely fictitious. These details have been created for demonstration
purposes only and do not refer to any actual organisation, telephone number, vehicle, etc., or to any actual person,
living or dead.
The text of this document may include references to previous releases of the product for example, in screenshots
and procedural examples. Regardless of any versions that may be mentioned, this manual describes the current
functionality provided by the release of the software identified on the title page.
Trademarks
Memex, Textract and Total Content Access are registered trademarks of Memex Technology Limited. Microsoft,
PowerPoint and Windows are registered trademarks of Microsoft Corporation. Other product, brand and company
names mentioned herein are trademarks or registered trademarks of their respective owners and should be treated
as such.
2.0a-5-IJ -AC-20070912-1.6
Contents
Scope............................................................................................................5
Related documents...............................................................................................5
Product names......................................................................................................5
Introduction.................................................................................................7
AutoWeb toolbar...................................................................................................................... 7
AutoWeb server....................................................................................................................... 7
Chapter 1 Installing the AutoWeb server....................................................8
Server components...............................................................................................8
Installation prerequisites...................................................................................10
SFU requirements .................................................................................................................. 10
Installing the server components ......................................................................11
Installing using the auto-installer ............................................................................................ 11
Using the auto-installer on Windows ....................................................................................... 11
Using the auto-installer on Solaris or Linux.............................................................................. 12
Installing using the tar file...................................................................................................... 12
Creating extra databases........................................................................................................ 15
Setting up the AutoWeb configuration file.........................................................15
The default spider.cfg file....................................................................................................... 17
HTTrack options and robots.txt............................................................................................... 17
Upgrading to AutoWeb 2.0.................................................................................18
Unpack the installation package.............................................................................................. 18
Updating the configuration database....................................................................................... 18
Run the upgrade scripts ......................................................................................................... 19
Chapter 2 Installing the AutoWeb client...................................................20
Installing the toolbar..........................................................................................20
Configuring the toolbar........................................................................................................... 20
Configuring the toolbar from the Windows registry................................................................... 21
How the toolbar works........................................................................................................... 22
Memex Analyst forms.........................................................................................23
Installation tasks................................................................................................24
Memex Intelligence Engine..................................................................................................... 24
Memex Patriarch.................................................................................................................... 24
AutoWeb databases for Memex Patriarch................................................................................. 25
Configuration tasks............................................................................................27
Modifying the spider.cfg file.................................................................................................... 27
Linking to the WebConfig database......................................................................................... 27
Memex Technology Ltd A Guide to AutoWeb
Linking to the WebArchive database........................................................................................ 28
Setting up picklists.............................................................................................29
Adding additional web archives .........................................................................29
Chapter 4 Using AutoWeb..........................................................................31
Selecting a Memex database..............................................................................31
Specifying keywords...........................................................................................31
Indexing Web page text.....................................................................................31
Indexing a Web page..........................................................................................31
Viewing indexed pages.......................................................................................32
Monitoring Web sites..........................................................................................33
Specifying the sites you want to monitor...........................................................33
Specifying sites - Memex Patriarch.......................................................................................... 33
Specifying sites Memex Analyst............................................................................................ 34
Fields on the configuration form.............................................................................................. 35
How Web site monitoring works........................................................................37
Stopping getsite.pl .............................................................................................37
Extracting the Web page text.................................................................................................. 38
Appendix A Known limitations...................................................................39
Appendix B Troubleshooting......................................................................40
Appendix C HTTrack options......................................................................41
Appendix D Upgrading to AutoWeb 1.3.....................................................43
Backing up your previous AutoWeb setup.........................................................43
Installing AutoWeb 1.3.......................................................................................43
Converting your AutoWeb data..........................................................................44
Setting up the conversion script.............................................................................................. 44
Running the conversion script................................................................................................. 44
4
Memex Technology Ltd A Guide to AutoWeb
Scope
Thisguideprovidesdetailedinstallationanduserinstructionsforrelease2.0ofAutoWeb.
Thedocumentcontains:
AnoverviewoftheAutoWebapplication
Installationandconfigurationinstructionsfortheclientandservercomponents
Detaileduserinstructions
Informationonknownlimitations
Instructionsonhowtoupgradefromapreviousrelease
Ifyouhaveanycommentsaboutthisguide,pleasecontactMemexCustomerSupport:
support@memex.com
Related documents
ForfurtherinformationaboutthisreleaseofAutoWeb,pleasereadtheAutoWebReleaseNotes.
Product names
ThismanualcontainsreferencestootherMemexproducts.Thenamesofsomeofthese
productswerechangedrecentlyfornewreleasesofthesoftware.Thenamechangesare
showninthefollowingtable.
Current name Previous name Notes
MemexPatriarch IntelligenceManager MemexPatriarchisadesktopclient
application,whereasIntelligenceManager
comprisesadesktopapplicationplus
variousservercomponents.
MemexAnalyst IntelligenceAnalyst
MemexSeriesVI TheIntelligence
Managerbundle
MemexSeriesVIandtheIntelligence
Managerbundlearesetsofcompatible
products.
MemexSeriesVI
Server
TheIntelligence
Managerserver
componentsplusthe
MemexIntelligence
Engine
TheMemexSeriesVIServercomprisesthe
MemexIntelligenceEngineplusvarious
servercomponentsthatsupporttheclient
applications.
5
Memex Technology Ltd A Guide to AutoWeb
Thismanualusesthenameofthecurrentreleaseofthesoftwareunlessspecificallyreferring
toanolderrelease.Unlessstatedotherwise,detailsreferringtoaproductbyitscurrentname
alsoapplytoreleasesoftheproductsthatusedthepreviousname.
Introduction
AutoWebprovidesaneasywaytoextracttextfromaWebsiteandtransferittoaMemex
database.
AutoWebhastwomaincomponents:
AtoolbarthatintegratesintoInternetExplorerandallowsyoutoindexindividualpages
directlyfromthebrowser.
Aserversideprocessthatyoucaneitherrunmanuallyoraspartofacronjob.
AutoWeb toolbar
WhenyouusetheAutoWebtoolbar,youcanchoosetoindexallthetextfromaWebpageor
justindexselectedtext.ThetoolbaralsoallowsyoutospecifytheMemexdatabasewhereyou
wanttoindextheWebpage,andtoenterkeywordsassociatedwiththepage.
AutoWeb server
Theserversideprocessreadsthecontentsofaconfigurationdatabasecontaininginformation
onwhichpagesshouldbeindexed.Theprocessthenmirrors(thatis,storesalocalcopyof)
eachWebpageandcreatesarecordinaMemexdatabase.Themirroredfilesareusedfor
displayingtheWebpageinabrowser.ThedatabaseisusedforretrievingaWebpagebased
onasearchqueryenteredinMemexPatriarchorMemexAnalyst.
Wheneverapageisindexed,eitherfromthetoolbarorfromtheserverprocess,AutoWeb
makesacopyofthepage.Thisenablesyoutoaccesshistoricalcopiesofthepagesyouhave
indexed.
Note AutoWeb is designed to be integrated with Memex Patriarch and Memex Analyst or
Intelligence Manager and Intelligence Analyst if you are using older versions of these
applications. You can use either application to view the configuration and index
records and access the indexed Web pages.
Chapter 1
Installing the AutoWeb server
Server components
ThistableliststhecomponentsthattheAutoWebserverinstallationprocessinstalls.
Name Details
bin/HTTrack HTTr ackisautilitythatisusedtomirrorWeb
pages.
bin/libhttrack.so.1 SharedlibraryforHTTr ack(forSolaris)
bin/lynx Lynxisatextbasedbrowserutilitythatisusedto
extractthetextfromWebpages.
bin/lynx.cfg ConfigurationfilefortheLynxutility.
bin/getsite.pl Thisperlscriptisrunasacronjob.Itlooksatthe
contentsoftheconf i g. dbdatabaseandindexes
anysitesthathavebeensetup.
bin/addtomemex.pl ThisperlscriptiscalledbyanyHTTr ackprocess
thatislaunchedfromgetsite.pl.HTTr ackcallsthis
scripteverytimeitdownloadsafile.Thescriptthen
decideswhattodowiththefileandaddsarecordto
adatabaseifnecessary.
bin/addpagefile.pl ThisperlscriptiscalledbyanyHTTr ackprocess
thatislaunchedfromthefileI ndexPage. pl .
HTTrackcallsthisscripteverytimeitdownloadsa
file.Thescriptthendecideswhattodowiththefile.
cgibin/Bar.pl Thisisacgiscriptforbackwardscompatibilitywith
theoriginalMemextoolbar(Version1.0a).This
controlswhatappearsonthatversionofthetoolbar
andtheactionsthatthetoolbarbuttonsperform.
cgibin/Databases.pl ThisisacgiscriptthatisusedbythenewMemex
toolbar(Version1.0b)todeterminethelistof
databases.
cgibin/IndexPage.pl Thisisacgiscriptthatiscalledwheneverauser
selectsIndexSelectedTextorIndexPage.
8
Memex Technology Ltd A Guide to AutoWeb
Name Details
config.db Thedatabasethatcontainsinformationonwhatsites
getsite.plshouldindex.
databases Thisdirectorycontainsallthedatabaseswherethe
indexedpagesarestored.
dbconfigs Thisdirectorycontainsthedatabaseconfigs.
images/memexbar.bmp Thisbitmapisanimagelistforthetoolbar.
install Theinstallscriptfortheserverinstallation.
mirror ThisdirectorycontainsthemirroredWebpages.
spider.cfg ThisistheconfigfileforAutoWeb.
locales/EN.loc Englishlocalefile.
perlmodules/Config/General.pm Requiredperlmodule.
perlmodules/Config/General/
Extended.pm
Requiredperlmodule.
perlmodules/Config/General/
Interpolated.pm
Requiredperlmodule.
perlmodules/File/Basename.pm Requiredperlmodule.
perlmodules/File/CheckTree.pm Requiredperlmodule.
perlmodules/File/Compare.pm Requiredperlmodule.
perlmodules/File/Copy.pm Requiredperlmodule.
perlmodules/File/DosGlob.pm Requiredperlmodule.
perlmodules/File/Find.pm Requiredperlmodule.
perlmodules/File/Path.pm Requiredperlmodule.
perlmodules/File/Spec.pm Requiredperlmodule.
perlmodules/File/stat.pm Requiredperlmodule.
perlmodules/File/Spec/Functions.pm Requiredperlmodule.
perlmodules/File/Spec/Mac.pm Requiredperlmodule.
perlmodules/File/Spec/OS2.pm Requiredperlmodule.
perlmodules/File/Spec/Unix.pm Requiredperlmodule.
perlmodules/File/Spec/VMS.pm Requiredperlmodule.
perlmodules/File/Spec/Win32.pm Requiredperlmodule.
9
Memex Technology Ltd A Guide to AutoWeb
Installation prerequisites
BeforeyoucaninstalltheAutoWebserver,yoursystemmustcontain:
Oneofthefollowingoperatingsystems:
SunSolaris10
RedHatEnterpriseLinux4
MicrosoftWindowsServicesforUNIX3.5
Perl5.0orgreater
MemexIntelligenceEngine(MIE)6.0
Apache2HTTPserver.
ApachebeconfiguredtorunastheMemexadministratoruser.
ToconfigureApache2torunastheMemexadministratoruser:
ChangetothedirectorywhereApacheshttpd.conffileislocated.Forexample:
cd /usr/local/apache2/conf
Editthehttpd.conffilewithaplaintexteditor,suchasvi.
Locatethesectionoftheconfigurationfilethatspecifiestheuseraswhomthehttpd
servicewillrun.Forexample,toforceApache2torunastheusermxadmininthe
groupmxadmins,addorchangetheUserandGrouplines:
User mxadmin
Group mxadmins
ApacheslogfilesmustbewritablebytheMemexadministratoruser(typicallymxadmin
ormxroot).
TodothisonSolarisorLinux:
suasroot
ChangetheownershipofthedirectorywhereApacheslogfilesreside.Thelocation
ofthelogfilesisspecifiedinApacheshttpd.conffile.Thedirectoryanditscontents
shouldbeownedbytheMemexadministratoruser.Forexample:
chown -R mxadmin:mxadmins /var/apache2/logs
TodothisonWindowsSFU:
FromanSFUcommandconsole,suasAdministrator.
ChangetheownershipofthedirectorywhereApacheslogfilesreside.Thelocation
ofthelogfilesisspecifiedinApacheshttpd.conffile.Thedirectoryanditscontents
shouldbeownedbytheMemexadministratoruser.Forexample:
chown -R SERVERNAME+mxadmin:SERVERNAME+mxadmins
/usr/local/apache2/logs
Torunthegetsite.plscriptasacronjob(seeMonitoringWebsitesonpage33),theMemex
administratoraccount(usuallymxadminormxroot)musthaveahomedirectory.
SFU requirements
IfyouareinstallingonSFU,youmustfirstinstallthefollowingsoftwarepackages:
Package name Description
httpd Apache2HTTPServer
lynx LynxWebbrowserforterminals
10
Memex Technology Ltd A Guide to AutoWeb
zlib Zlibdatacompressionlibrary
ThesepackagesareavailablefromtheSFUToolsWarehouseWebsite:
http://www.interopsystems.com/tools/warehouse.htm
Toinstallthesepackages,firstdownloadandinstallthepackageinstallerthatisavailableasa
shellscriptfromthesameWebsite.Youcanthenissuesimplecommandsfromashell
consolewindowthatusethepackageinstallertodownloadandinstallthesoftwarepackages
andalltheirdependencies.Forexample,toinstallApache2,runthecommand:
pkg_updat e L ht t pd
Formoreinformation,seetheSFUToolsWarehouseWebsite.
Installing the server components
ThemethodinstallingtheAutoWebservercomponentsvariesdependingonwhetheryour
MIEwasinstalledaspartofaMemexSeriesVIServerinstallation.Ifyouareadding
AutoWebtoaMemexSeriesVIsystem,usetheautoinstallermethoddescribedhere.
Otherwiseusethetarfilemethodonpage12.
Installing using the auto-installer
TheautoinstallerisavailableforWindows,LinuxandSolaris.YoumusthaveaMemex
SeriesVIServersetuptobeabletousetheAutoWebautoinstaller.
Using the auto-installer on Windows
1. Locatetheautoweb_windows.exefileinWindowsExplorer.
2. RightclickthisfileandchooseRunAs.
3. SelectThefollowinguserandenter<COMPUTERNAME>\Administrator.
4. EnterthepasswordforAdministratorandclickOK.
5. Followthesetupinstructionsonscreen:
MemexrecommendsleavingthedestinationdirectoryasC:\SFU\opt\memex
Inmostcasesyoucanleavethehostnameandportsettingsattheirdefaultvalues:
Hostname:l ocal host
Port:9001
EnterthenameandpasswordofanMIEsuperuser.TocheckthenamesofcurrentMIE
superusers,lookatthevaluesofthesuperuserselementinthememexsvr.xmlfile
(usuallylocatedin/opt/memex/etc).
6. Asinstructedattheendoftheautoinstallationprocess,addanIncludestatementto
Apacheshttpd.conffile.
Forexample,fromanSFUshell,runthecommand:
11
Memex Technology Ltd A Guide to AutoWeb
echo " I ncl ude / opt / memex/ aut oweb/ conf i g/ apache2. conf " >>
/ usr / l ocal / apache2/ conf / ht t pd. conf
7. Start,orrestart,ApacheWebserver:
/ usr / l ocal / apache2/ bi n/ apachect l r est ar t
14
Memex Technology Ltd A Guide to AutoWeb
Creating extra databases
Onesampledatabaseiscreatedaspartoftheinstallationprocess.Thesampledatabaseis
calledwebarchive.ThedirectoryforAutoWebdatabasesis:/opt/memex/autoweb/databases.
Youcancreateextradatabasesbyusingthens_createcommandfollowedbythemkphonetic
command.Forexample:
ns_create -c /opt/memex/autoweb/dbconfigs/config.archive
-n 8192 /opt/memex/autoweb/databases/mynewdb
mkphonetic /opt/memex/autoweb/databases/mynewdb
SeetheMemexIntelligenceEngineAdministratorsGuideformoreinformationonthens_create
andmkphoneticutilities.
Setting up the AutoWeb configuration file
spider.cfgistheconfigurationfileforAutoWeb.Thistableliststheentriesthatthe
configurationfilemustcontain.Thedefaultspider.cfgfileisshownonpage17.
Name Details
installpath
TheinstallationdirectoryoftheAutoWebserver.Thisissetautomatically
bytheinstallscript.
Forexample:/ opt / memex/ aut oweb
locale
ThelanguagelocaletousefortheserverresponsestotheMemextoolbar.
Thismustbesettomatchoneofthefilesinthelocalesdirectoryinthe
installationpath.
Forexample:EN
mirrorurl
TheURLforthemirrordirectory.Thismustcontainthefulldomainname
andthealiasthatyougaveforthemirrordirectory.
Forexample: ht t p: / / ser ver . domai n. com/ aut oweb- mi r r or
httracklib
ThepathtothelibfileforHTTrack.
Forexample:/ opt / memex/ aut oweb/ bi n
httrack
ThepathtotheHTTrackexecutable.
Forexample:/ opt / memex/ aut oweb/ bi n/ ht t r ack
opts
Theoptionsthatgetsite.plusestocallHTTrack(seeHTTrackoptionsand
robots.txtonpage17).
Forexample:- n - %e0
stdopts
Moreoptionsthatgetsite.plusestocallHTTrack.
Forexample: - I 0 - Qq - - assume cf m=t ext / ht ml , php=t ext / ht ml
- X0 - %F " "
15
Memex Technology Ltd A Guide to AutoWeb
Name Details
append
Thepathtothens_appendutility.
Forexample:/ opt / memex/ mi e/ bi n/ ns_append
decode
Thepathtothedecodeutility,
Forexample:/ opt / memex/ mi e/ bi n/ decode
configdb
Thepathtotheconfigdatabaseforgetsite.pl.
Forexample:/ opt / memex/ aut oweb/ conf i g. db
lynx
Thepathtothelynxutilityandtheparametersthatmustbepassed.
Forexample:/ opt / memex/ aut oweb/ bi n/ l ynx
cf g=" / opt / memex/ aut oweb/ bi n/ l ynx. cf g"
domain
Thewebserverdomain.
Forexample:ser ver . domai n. com
imglst
Thepaththatwillbeaddedtothedomaintoretrievetheimagelistforthe
toolbar.Thefirstpartofthismustbethenamethatyougavetothealias
forthe/imagesdirectory.
Forexample:/ aut oweb- i mages/ memexbar . bmp
cgi-bin
Thepaththatwillbeaddedtothedomaintoaccessthecgibinfor
AutoWeb.Thismustbethenameofthealiasthatyougaveforthecgibin
directory.
Forexample:/ aut oweb- bi n/
pageopts
Theoptionsusedinthecallfromindexpage.pltoHTTrack.
Forexample: - %P0 C0 - I 0 - %Q - n - Qq - d - - assume
cf m=t ext / ht ml , php=t ext / ht ml - X0 - %F " "
logfile
ThelocationofthelogfileforAutoWeb.Ifthisentrydoesnotexist,nolog
fileiscreated.
Forexample:/ opt / memex/ l ogs/ cr awl er l og. t xt
filtertypes
AlistofthefiletypesthatAutoWebwillnotwritearecordfor.
Forexample:
r a| r am| j pg| gi f | pbm| mov| avi | wmv| css| pdf | ps| j s| xml | r df
lockfile
Thelockfilethatisusedtopreventget si t e. pl fromrunningmorethan
once.
Forexample:/ t mp/ aut owebl ock
notrenamed
AlistofthefiletypesthatHTTrackdoesnotrenameashtml.
Forexample:ht ml | ht m| t xt
imbase
TheinstallationdirectoryoftheMemexPatriarchsoftwareontheserver.
ThisentryisoptionalandisonlynecessaryifyouwanttouseAutoWeb
fromwithinMemexPatriarch.
Thisparametershouldusuallybesetto:/ opt / memex/ i m
16
Memex Technology Ltd A Guide to AutoWeb
Name Details
rollover
Thenumberofdaysbeforethemirrordirectoryisrolledover.
Rollingoverthemirrordirectoryinvolvescreatinganewsubdirectoryin
thelocationspecifiedbythemirrorurlsetting.Ifyouleavethisatthe
defaultof7,anewmirrorsubdirectoryiscreatedevery7daysforstoring
Webpagesin(2007001,2007002andsoon).
Toturnoffthisprocess,setthevalueto0,althoughthisisnot
recommended.Thedefaultandrecommendedvalueintheprovidedfile
is7.
Note You use different configuration file variables to specify the HTTrack options,
depending on how you are running AutoWeb:
If you are running the AutoWeb toolbar, use the pageopts variable to specify the
HTTrack options.
If you running AutoWeb as a cronjob via getsite.pl use the StdOpts variable
to specify the HTTrack options.
The default spider.cfg file
#Conf i g f i l e f or I nt el l i gence Mi r r or
i nst al l pat h / opt / memex/ aut oweb
mi r r or ur l ht t p: / / localhost/ aut oweb- mi r r or
ht t r ackl i b / opt / memex/ aut oweb/ bi n
ht t r ack / opt / memex/ aut oweb/ bi n/ ht t r ack
opt s - n - %e0 - A32000
st dopt s - I 0 - Qq - - assume cf m=t ext / ht ml , php=t ext / ht ml - X0 - %F " "
append / opt / memex/ mi e/ bi n/ ns_append
decode / opt / memex/ mi e/ bi n/ decode
conf i gdb / opt / memex/ aut oweb/ conf i g. db
l ynx / opt / memex/ aut oweb/ bi n/ l ynx -
cf g=" / opt / memex/ aut oweb/ bi n/ l ynx. cf g"
domai n localhost
i mgl st / aut oweb- i mages/ memexbar . bmp
cgi - bi n / aut oweb- bi n/
pageopt s - %P0 - C0 - I 0 - %Q - n - Qq - d - - assume
cf m=t ext / ht ml , php=t ext / ht ml - X0 - %F " "
l ogf i l e / opt / memex/ aut oweb/ cr awl er l og. t xt
f i l t er t ypes r a| r am| j pg| gi f | pbm| mov| avi | wmv| css| pdf | ps| j s| xml | r df
l ockf i l e / t mp/ spi der l ock
not r enamed ht ml | ht m| t xt
l ocal e EN
i mbase / opt / memex/ i m
r ol l over 7
HTTrack options and robots.txt
Arobots.txtfileisstoredintherootofmostWebservers.Thisfilealertscrawlersandweb
spiders,suchasAutoWeb,astowhichpagestheyshouldignorewhenretrievingpagesfrom
theremoteWebserver.
TheoriginalspecificationofthisstandardandtheIETFdraftareavailablefromthefollowing
sites:
http://www.robotstxt.org/wc/norobots.html
17
Memex Technology Ltd A Guide to AutoWeb
http://www.robotstxt.org/wc/norobotsrfc.html
Becauserobots.txtrestrictsthefilesthatcanbedownloadedbywebspiders,ithasanimpact
ontheAutoWebserversoftwareanditsabilitytotrackandstoreWebpages.
AutoWebusesHTTracksoftwaretoretrieveremoteWebpages.Ifrequired,youcan
configureHTTracktoeitherfolloworignorethedirectivesintherobots.txtfile.Youdothis
bychangingtheopt ssettinginthespider.cfgfile.Formoreinformation,seeAppendixC
HTTrackoptionsonpage41.
19
Chapter 2
Installing the AutoWeb client
Installing the toolbar
ToinstalltheAutoWebtoolbar:
1. InWindowsExplorer,browsetothelocationofthesuppliedAutoWeb.exefileforthe
clientapplication.
2. DoubleclickAutoWeb.exe.
ThislaunchestheAutoWebInstallShieldprogram.
3. ClickYestoacceptthelicenseagreement.
ThisdisplaystheChooseDestinationLocationpage.
4. Browsetothelocationwhereyouwanttoinstallthefiles,andclickNext.
TheInstallShieldprograminstallstheAutoWebfilesanddisplaysaconfirmation
messagewhentheinstallationiscomplete.
5. ClickFinishtoacknowledgethemessage.
Configuring the toolbar
AfterinstallingtheAutoWebtoolbar,youneedtoopenInternetExplorerandmakesurethat
thetoolbarisnowavailable.
Ifthetoolbarisnotvisible,chooseView>Toolbars>AutoWeb.ThisaddstheAutoWeb
toolbartoInternetExplorer.
Thetoolbarshouldlooklikethis:
ToconfiguretheAutoWebtoolbar:
1. ClickthearrowbesidetheAutoWebbuttonandchooseConfigurationfromthedrop
downlist.
20
Memex Technology Ltd A Guide to AutoWeb
ThisdisplaystheConfigurationdialogbox.
2. EntertheURLofthecgibindirectoryonthewebserverwheretheAutoWebserver
softwareisinstalled.Typically,thisis:http://server.domain/autowebbin/
Forexample:http://achilles.memex.com/autowebbin/
YoucancheckthisvaluebylookingfortherelevantScriptAliasentryinApaches
httpd.conffile(orinthe/opt/memex/autoweb/config/apache2.conffileforan
installationwithMemexSeriesVIServer).
3. ClickOK.
ThisenablestheAutoWebtoolbar.Allthetoolbaroptionswillnowbeavailable.
Configuring the toolbar from the Windows registry
IfyouareinstallingtheAutoWebtoolbaronasignificantnumberofmachines,orifyouwant
torestrictuseraccesstotheConfigurationoption,youcanconfigurethetoolbarviaaspecific
registryfileautoweb.reg.ThisfileissuppliedbyMemexalongsidetheclientinstallation
file.
Youspecifythefollowingsettingsintheautoweb.regfile:
URL
ThefullURLofthecgibindirectoryonthewebserverwheretheAutoWebserver
softwareisinstalled.
Conf i gDi sabl ed
ADWORDvalueintheregistry.Setthisto1(oranynonzerovalue)todisablethe
AutoWebtoolbarsConfigurationmenuoption.
Forexample,atypicalautoweb.regfilelookslikethis:
REGEDIT4
[HKEY_LOCAL_MACHINE\SOFTWARE\Memex Technology Ltd\AutoWeb]
"URL"="http://server.domain/autoweb-bin/"
"ConfigDisabled"=dword:00000000
DoubleclickthisfiletoapplythechangestotheWindowsregistryofthelocalcomputer.
Note These settings apply to all user accounts on the computer. The changes are applied
to Internet Explorer the next time it is started.
21
Memex Technology Ltd A Guide to AutoWeb
Toaddafurtherlevelofsecurity,youcanplacesecuritypermissionsontheseregistrykeysto
preventthembeingchanged.Thisstopsusersfromreconfiguringthetoolbarthemselves.For
moreinformationonsettingpermissionsforregistrykeys,seeyourMicrosoftWindows
documentation.
How the toolbar works
Implementation
TheAutoWebtoolbarisimplementedasanativeDeskBandcomponentforInternetExplorer
usingVisualC++.ThisrequirestheMXAutoWeb.dllfiletoberegisteredoneachclient
machine.Afterthelibraryisregistered,userscandisplaythetoolbarbyaccessingInternet
ExplorerandselectingView>Toolbars>MemexAutoWebToolbar.
Configuration
Thetoolbarconfigurationiscontrolledbythefollowingregistrykey:
HKEY_LOCAL_MACHINE/Software/Memex Technology Ltd/AutoWeb
ThiskeyisheldunderthestringvalueURL,whichcontainsthebaseURLtothecgibin
directoryonthewebservercontainingtheCGIscripts.
Processing index requests
WhenauserclicksIndexPageorIndexSelectedTextonthetoolbar,AutoWebsendsan
HTTPrequesttotheIndexPage.plPerlCGIscript,locatedwithinthecgibindirectoryonthe
server.
Thisrequestcontainsthefollowingparameters:
TheMemexdatabasewheretheindexedtextwillbestored
Thekeywordstoaddtothedatabaserecord
The(selected)textfromthepage
Anindicationastowhethertheuserisindexingtheentirepageorjustselectedtext
TheWebpagesURL
IndexPage.plthencallsHTTrackfortheURL(thiscallisruninthebackground).HTTrack
attemptstocreateamirrorofthatpage.
ThiscalltoHTTrackcontainsaparameterspecifyingwhethereachindexedfilewillcontaina
timestampinthefilename.HTTrackinturncallsaddpagefile.pl,whichcomparesthenew
indexedfilewiththemostrecentversiononthelocalserver.
Ifthefilesarethesame,thenewversionisdeletedandreplacedwithasymboliclinkto
themostrecentfile.
Ifthefilesaredifferent,thenewfilebecomesthemostrecentversionandisusedfor
anysubsequentcomparisons.
AftercompletingthecalltoHTTrack,IndexPage.plwritesarecordintothespecified
databasecontaining:
TheoriginalURL
TheURLofthemirror
Thekeywords
The(selected)textfromthepage
22
Memex Technology Ltd A Guide to AutoWeb
Thedateandtime
Note All the responses that IndexPage.pl returns to the user come from the selected
locale file within the locales directory. If no locale is set, the default English locale
(stored in EN.loc) is used.
Memex Analyst forms
TwonewMemexAnalystformsareinstalledaspartoftheAutoWebclientinstallation:
WebAr chi ve. mf m
Usethisformtoviewanyrecordsindatabasesthatstoreinformationonindexedweb
pages.
Cont r ol . mf m
Usethisformtoviewtherecordsintheconf i gdatabaseontheserver.
Tousetheseforms:
1. GotothePropertiesdialogboxforthedatasource.
2. IntheFormsection,choosetheUseformradiobutton.
3. Clickthe buttontobrowsetothelocationoftheformfilesonthelocalcomputer,
typically:
C:\ProgramFiles\MemexTechnologyLtd\AutoWeb\
4. SelectthefileandclickOpen.
ThescreenshotbelowgivesanindicationofhowthePropertiesdialogshouldlookonce
youhaveselectedyourform.
5. ClickOKinthePropertiesdialogbox.
23
Chapter 3
Using AutoWeb with Memex Patriarch
Note This chapter contains information on configuring AutoWeb to be used with Memex
Patriarch on a Memex system that was manually installed. If your system is a
Memex Series VI Server that was installed using the provided auto-installer (i.e. you
use Memex Patriarch to administer your system), you can skip this chapter and
continue reading Chapter 4
Using AutoWeb on page 31.
AutoWebisdesignedtointegratewithMemexPatriarchandMemexAnalyst.However,you
mustperformsomeextrainstallationandsetuptaskstouseAutoWebwithinMemex
Patriarch.
Important You can use either Memex Patriarch or Memex Analyst for choosing the Web
sites you want AutoWeb to monitor. However, you cannot configure AutoWeb
from both applications. The steps described in this section enable configuration
from within Memex Patriarch. This will disable configuration from within Memex
Analyst. You will still be able to view the configuration records in Memex
Analyst, but you will only be able to add or edit configuration records from
Memex Patriarch.
Installation tasks
Memex Intelligence Engine
ForMemexPatriarchandAutoWebtoworktogether,MIE6.0mustbeinstalledonallthe
serversthatwillbeusedtohostbothMemexPatriarchandAutoWeb.
Notes You do not need to place Memex Patriarch and AutoWeb on completely
separate physical machines. A single MIE instance can host both the Memex
Patriarch and AutoWeb databases.
If your system uses multiple physical servers, all the physical machines must
share the same secret file to allow for certificate authentication.
FordetailsonhowtosetuptheMIEonyourservers,readtheMIE6.0InstallationGuide.
Memex Patriarch
TheMemexPatriarchserversidecomponentscanbeinstalledintwoways:
1. UsingtheMemexSeriesVIServerautoinstaller
2. UsingthePerlbasedinstaller
24
Memex Technology Ltd A Guide to AutoWeb
ThePerlbasedinstallerprovidesawaytospecifymanyoftheconfigurationoptionsduring
theinstallationprocess,whereastheautoinstallerprovidesaquickwaytoinstallaprebuilt
installation.
ThissectionrelatestoMemexserverinstallationsdoneusingthePerlbasedinstaller.This
installerwillalsobeusedtoinstallthetwoAutoWebdatabasesforMemexPatriarch.
FormoreinformationonthePerlbasedinstallerseetheMemexSeriesVIServerInstallation
Guide:PartIIPatriarchComponents.
AutoWeb databases for Memex Patriarch
AutoWebcontainstwoMemexdatabasedefinitionsthatyoucanusetoinstallAutoWeb
databasesforMemexPatriarch.ThesedatabasesallowyoutosearchandcontrolAutoWeb
frominsideMemexPatriarch
Toenablethesedatabasedefinitions,copytheim13autowebdirectoryintotheiminstall
directory(whichwascreatedwhenthePerlbasedinstallerwasusedtoinstalltheMemex
Patriarchservercomponents).Forexample,
cp - R / opt / memex/ aut oweb/ i m13aut oweb / opt / memex/ i m/ i m- 2. 0a- 105- vani l l a-
i nt er i x/ i m- i nst al l
Important If you deleted the im-install directory after installing the Memex Series VI
Server, you will no longer have the Perl-based installer. You need this to
proceed with this installation procedure. Contact Memex Customer Services
and request a copy of the tar file containing the Perl-based installer for the
Memex Patriarch server components.
The installer for the Memex Patriarch server components must be run on the
physical machine that hosts the Memex configuration server. If AutoWeb is
installed on a machine that is not the configuration server, you must copy
the AutoWeb database definitions to the configuration server, by transferring
the im13autoweb directory across the network to the physical machine that
is hosting the configuration server.
BeforeyoucaninstalltheAutoWebdatabasedefinitions,youneedthefollowinginformation
aboutyourMemexSeriesVIServersetup:
ThehostnameandportnumberfortheMemexIntelligenceEnginethatyouwilluseto
accesstheAutoWebdatabases
TheprefixandnameofthelogicalserverthatwillhosttheAutoWebdatabases
Youwilladdthisinformationtotheinstallerssetup.xmlfiletospecifywheretheAutoWeb
databaseswillbecreated.
Editing the setup.xml file
Whenyouhavecopiedtheim13autowebdirectorytotheiminstalldirectory,youmust
modifythesetup.xmlfilewithintheiminstall/im13autowebdirectory.Thisfilecontainsthe
databasedefinitionsforthetwonewAutoWebdatabases:WebConfigandWebArchive.It
alsodefinesanewlogicalservernamedAutoWeb(prefixAW).
25
Memex Technology Ltd A Guide to AutoWeb
IfyouwanttocreatetheAutoWebdatabasesonaremoteserver,youmustedittheattributes
forthehost element,specifyingtheserverwherethenewAutoWebdatabaseswillbe
created.Todothis,changetheattributesto:host name=" hostname"por t =" number" .
Forexample:<hosthostname=cutlassport=9001>
Alternatively,tocreatetheAutoWebdatabasesonthesamephysicalmachineastheMemex
Patriarchconfigurationserver,leavethehost attributeas:
<host l ocal =y>
Installing the AutoWeb databases
Aftereditingtheset up. xml file,youmustrunthePerlbasedinstallerforMemexPatriarch,
toinstallthenewAutoWebdatabasesandlogicalserver.
ToinstalltheAutoWebdatabasesandserver:
1. Changetotheiminstalldirectoryontheconfigurationserver.Forexample:
cd /opt/memex/im/im-2.0a-105-vanilla-interix/im-install
2. Runthefollowingcommand:
perl install.pl c <CS_Prefix> -i <Patriarch_Install> -m
<MIE_Install> -x <MIE_Config> -p <Local_MIE_Port> -f
autoweb/im13autoweb
Where:
<CS_Prefix>istheprefixofthelogicalserverusedastheconfigurationserver(usually
CS).
<Patriarch_Install>isthedirectorywheretheMemexPatriarchserverside
componentsareinstalled(usually/opt/memex/im).
<MIE_Install>isthedirectorywheretheMIEisinstalled(usually/opt/memex/mie).
<MIE_Config>isthepathtotheMIEconfigurationfile(usually
/opt/memex/etc/memexsvr.xml).
<Local_MIE_Port>istheTCPportonwhichthelocalMIElistensforconnections.
Forexample:
perl install.pl -c CS -i /opt/memex/im -m /opt/memex/mie -x
/opt/memex/etc/memexsvr.xml -p 9001 -f autoweb/im13autoweb
3. Whenthedetailsoftheinstallationaredisplayed,enterytoconfirmthatyouwantto
continuewiththeinstallation.
4. EntertheusernameandpasswordoftheMemexPatriarchsuperuser.
Thescriptcompletestheinstallation.
26
Memex Technology Ltd A Guide to AutoWeb
Configuration tasks
ToconfigureAutoWebtoworkwithMemexPatriarch,youmustupdateAutoWebtousethe
newentitiesthathavebeencreated.
Modifying the spider.cfg file
ThistaskismandatoryifyouwanttouseAutoWebwithMemexPatriarch.
Thespider.cfgfileislocatedintheaut owebdirectory.Thefilecontainsthesettingi mbase.
YoumusteditthissettingtopointtothedirectorywhereMemexPatriarchisinstalled.
Forexample,ifMemexPatriarchisinstalledin/opt/memex/im,youwouldchangethe
spi der . cf gfilesettingto:
imbase /opt/memex/im
Important You must modify spider.cfg before you make any of the other changes
described in this section. If you do not make this change, AutoWeb will not be
able to detect that it is inserting data into an Memex Patriarch database, and
the resulting records will be inaccessible from the client software.
27
Memex Technology Ltd A Guide to AutoWeb
2. Movetheconfig.dbasidebyenteringthefollowingcommand:
mv config.db config.db.old
3. CreatealinktotheWebConfigdatabasebyenteringwiththefollowingcommand:
ln s <path_to_im>/<autoweb_database_prefix>/databases/
WebConfig config.db
Forexample:
ln s /opt/memex/im/AW/databases/WebConfig config.db
AutoWebwillnowusetheWebConfigdatabaseratherthantheconfig.dbdatabase.
Note If you are upgrading your AutoWeb setup from a previous version, you must make
sure that a uniq_id file is stored in the WebConfig databases directory. You can do
this manually, or by adding a record to the database in Memex Patriarch.
For more information, consult the MIE Administrators Guide.
2. CreatealinktotheWebArchivedatabasebyenteringwiththefollowingcommand:
ln s <path_to_im>/<autoweb_database_prefix>/databases/
WebArchive <name_of_archive_database>
Forexample:
ln s /opt/memex/im/AW/databases/WebArchive webarchive
Notes
The AutoWeb toolbar will list the WebArchive database by the name of the
symbolic link usually webarchive.
If you are upgrading your AutoWeb setup from a previous version, you must
make sure that a uniq_id file is stored in the WebArchive databases directory.
You can do this manually, or by adding a record to the database in Memex
Patriarch.
For more information on the uniq_id file, see the Memex Intelligence Engine
Administrators Guide.
YouwillbeabletouseMemexPatriarchtoviewWebpagesindexedfromtheAutoWeb
toolbarbysearchingtheWebArchivedatabaseontheAutoWeblogicalserverwithinMemex
Patriarch.
Setting up picklists
Thisisanoptionaltask.
InMemexPatriarch,theWebConfigentitycontainsasinglepicklistfielddatabasewhich
holdsalistofallthedatabasesinAutoWeb.Thislistisnotautomaticallypopulated.You
shouldupdatethislistwheneveryouaddadatabasetoAutoWeb.
ForinformationonmodifyingpicklistsinMemexPatriarch,refertotheMemexPatriarch
OnlineHelp.
Chapter 4
Using AutoWeb
AutoWebisautilitythatallowsyoueasilytoaddthetextofaWebpagetoaMemex
database.Inadditiontothis,whenyouextracttextfromaWebpage,AutoWebcreatesa
mirroroftheWebpageonalocalserver.YoucanthenuseMemexAnalysttoviewthe
recordscreatedfromtheWebpagetextandtoviewthemirroredcopyoftheWebpage.
Selecting a Memex database
TospecifywheretheWebpagetextwillbestored,chooseadatabasefromtheSelect
Databasedropdownlist.
Specifying keywords
ToassociatekeywordswithanindexedWebpage,typethekeywordsintotheEnter
Keywordstextbox.
Indexing Web page text
ToextractspecifictextfromaWebpage,highlightthetextandthenclicktheIndexSelected
Textbutton.
Whenyouclickthisbutton,AutoWebalsomirrorstheentireWebpagetothelocalserver.
Indexing a Web page
ToextractthetextofanentireWebpage,clicktheIndexPagebutton.
31
Memex Technology Ltd A Guide to AutoWeb
Whenyouclickthisbutton,AutoWebalsomirrorstheentireWebpagetothelocalserver.
Viewing indexed pages
YoucanuseMemexPatriarchorMemexAnalysttoretrievetheindexedrecords.
TheindexedrecordforeachWebpagecontains:
TheURLoftheoriginalpage
TheURLofthemirroredcopyofthepage
Thedateandtimethatthepagewasindexed
Thetext(ortheselectedtext)fromthepage
Thekeywordsthatareassociatedwiththepage
IfMemexAnalysthasbeensetuptousetheformsdistributedwiththeAutoWebtoolbar,the
resultformdisplaysthemirroredcopyofthepagewhenyouviewoneoftherecords.The
screenshotbelowshowsanexampleofthis.
32
Memex Technology Ltd A Guide to AutoWeb
Monitoring Web sites
Thegetsite.plscriptindexesallthesitesthatarelistedasrecordsintheAutoWeb
configurationdatabase.ThisdatabaseiscalledWebConfigwithintheAutoWeblogicalserver
onaMemexSeriesVIinstallation,orcomprisesthefile/opt/memex/autoweb/config.dbon
installationscompletedusingthetarfilemethod.
Afteryouhavespecifiedthesitesyouwanttoindexyoucanrungetsite.plmanually,oryou
canconfigureittorunonaregularbasisasacronjob.Forexample,torungetsite.plasacron
jobonceanhour,enterthefollowing:
1 * * * * /opt/memex/autoweb/bin/getsite.pl
IfyouusedtheautoinstallertoinstallAutoWeb,thefollowingentriesareaddedtothecron
taboftheMemexadministratoruser:
# AutoWeb Run HIGH priority sites every hour
0 * * * * /opt/memex/autoweb/bin/getsite.pl HIGH
# AutoWeb Run MEDIUM priority sites every day
0 0 * * * /opt/memex/autoweb/bin/getsite.pl MEDIUM
# AutoWeb Run LOW priority sites every week
0 0 * * 1 /opt/memex/autoweb/bin/getsite.pl LOW
Thecombinationofconfigurationrecordsandthegetsite.plscript,runasacronjob,allows
youtomonitorthespecifiedWebsites.
Note The auto-installer for AutoWeb adds the above cron jobs to the cron tab of the
Memex administrator user. In the unlikely event that you run the auto-installer more
than once for example, if you delete installed files and then run the auto-installer
again a duplicate set of cron jobs will be added to the cron tab. So, if you run the
auto-installer more than once, you must edit the cron tab and remove the duplicate
entries.
EntervaluesfortheName,URLandDatabasefieldstospecifywhatyouwanttoindexand
whereyouwanttostoretheindexedWebpagedata.
Entervaluesfortheotherfields,asrequired.Thesefieldsaredescribedinthetableonpage35.
ClickAppendtosavethenewrecord.
34
Memex Technology Ltd A Guide to AutoWeb
EntervaluesfortheKeywords,SiteToIndexandDatabasefieldstospecifywhatyouwant
toindexandwhereyouwanttostoretheindexedWebpagedata.
Entervaluesfortheotherfields,asrequired.Thesefieldsaredescribedinthetablebelow.
Savethenewrecord.
SiteToIndex EnterthefullURLoftheWebsiteyouwanttoindex.Ifyouenterthe
URLofaWebsitewithoutspecifyingaparticularWebpage(for
example,http://www.yourcompany.com),AutoWebusesthehome
pageofthesiteasthestartpagefromwhichtoindex.Youcanindexan
areawithinaWebsitebyspecifyingaparticularpageonasite(for
example,http://www.youcompany.com/personnel/vacancies.html).
Indexed Index Thisfieldallowsindexingtobetemporarilyturnedoffbysettingthe
fieldvaluetoNO.ToresumeindexingsetthevaluetoYES.Thedefault
valueisYES,soWebsitesforrecordswithnovalueinthisfield(suchas
recordsfromupgradedversionsofAutoWeb)areindexed.
Database Database ThisisthenameofthedatabasetowhichindexrecordsfortheWebsite
aresaved.Thevalueisthenameofthedatabaseasitappearsonthefile
system,withintheautoweb/databasesdirectory.Thewebarchive
databaseisthedefaultavailabledatabasecreatedforsavingnewindex
recordsto.
Priority Priority Thevalueinthisfieldallowsindexingtobeperformedatdifferent
frequencies.Thisisachievedbyrunningthegetsite.plscriptagainsta
subsetofrecords,basedonthevalueofthisfield(asshowninthecron
tablistingonpage33).TheAutoWebautoinstallercreatesthree
Priorityoptionstochoosefrom.Chooseyourprioritydependingon
howoftenyouwantthesitetobeindexedandupdated.
35
Memex Technology Ltd A Guide to AutoWeb
Field MP Field MA Details
Thefrequencyofupdatesisdefinedasfollows:
HIGHprioritysitesareindexedeveryhour
MEDIUMprioritysitesareindexedeveryday
LOWprioritysitesareindexedeveryweek
Note:ThesefrequenciesaredefinedintheMemexadministratorusers
crontab.SeeMonitoringWebsitesonpage33formoredetails.
Options Crawler
Options
UsethisfieldtopassspecificoptionstotheHTTrackWebsiteCopier
software.HTTrackisathirdpartytoolusedbyAutoWebtocopyWeb
pages.Byspecifyingoptionsyoucanoverrulemanyaspectsof
AutoWebsdefaultbehaviour.
ForfulldetailsofthemanyoptionsforHTTrackseetheonlineUsers
Guideat:
http://www.httrack.com/html/fcguide.html
Theoptionthatyouaremostlikelytowanttospecifyisthelinkdepth.
AutoWebsdefaultlinkdepthis2.Thismeansthatyouwillindexall
thepagesthatarelinkedtofromthespecifiedstartpage(e.g.thehome
pageofaWebsite)plusallthepagesthatarelinkedtofromthose,
primarylink,pages.OnalargeWebsite,withpagesthateachcontain
manylinks,alinkdepthof2couldresultinhundredsofpagesbeing
indexed,andyoumay,therefore,wanttoreducethelinkdepth.Ona
smallWebsite,however,youmightwanttoincreasethelinkdepthto3
or4.
Theoptionforsettinglinkdepthis:
-%eN
WhereNisanintegertypicallybetween0and4.
Notes:
Youmustbeextremelycarefulwhenspecifyingoptions.Ifyouenter
invalidoptions,orthewrongoptionforthebehaviouryou
intended,itcanresultinnothingbeingindexed,unexpected
indexingresults,oreverythingontheentiredomainbeingindexed.
Ifyoudonotsetavaluehere,thelinkdepthdefaultsto2.
SettingahighlinkdepthvalueforalargeWebsitecanquickly
resultinyouusingupagreatdealofavailablediskspace.
Bydefault,AutoWebdoesnotindexpagesthatarelocatedoutside
thedomainonwhichthestartpageislocated.Thishelpstorestrict
indexingtoasingleWebsite.Youcanbypassthisrestrictionby
usingthe-eoption.However,youshouldusethisoptionwith
extremecautionasitcaneasilyresultinyouindexingavastnumber
ofpagesfromtheinternetatlarge.
36
Memex Technology Ltd A Guide to AutoWeb
Field MP Field MA Details
Linkdepth,bydefault,onlyextendstopagesonorbelowthe
currentdirectorylevel.Forexample,ifyouindex
http://www.memex.co.uk/AboutMemex/index.phpwithalink
depthof2,AutoWebwillindexpagessuchas
http://www.memex.co.uk/AboutMemex/Awards/index.php,asthis
pageislocatedinadirectorybelowthestartpage,butitwillnot
indexhttp://www.memex.co.uk/index.php,whichisinadirectory
abovethestartpage.Youcanusethe-BoptiontoallowAutoWeb
toindexupthedirectorystructureaswellasdownit.
HTTrackWebsiteCopierisopensource,thirdpartysoftware.
Memexisnotresponsibleforanyofthecontentonthe
www.httrack.comWebsite.
Notes Notes Youcanenteranytextaboutthesiteorthisparticularrecordherefor
yourownreference.
Stopping getsite.pl
Ifyouhavestartedgetsite.plandwanttostopit,youmustmanuallydosobykillingits
processandanyhttrackprocesses.
Tokillanygetsite.plandhttrackprocesses:
1. AsrootortheMemexadministratoruser,openashellconsole.
2. Typethefollowingcommand:
ps -eo pid,args|grep autoweb
Thisliststhecurrentlyrunningprocesseswhosedetailsmentionautoweb.
37
Memex Technology Ltd A Guide to AutoWeb
Forexample:
1545 grep autoweb
3197 /opt/memex/autoweb/bin/httrack -V /opt/memex/autoweb/bin/addtomemex
5371 /usr/contrib/perl -I/opt/memex/autoweb/perlmodules /opt/memex/autow
5513 sh -c /opt/memex/autoweb/bin/httrack -V '/opt/memex/autoweb/bin/add
3. UsethekillcommandwiththerelevantprocessIDnumbertostopeachofthelisted
processes,apartfromtheonementioninggrep,whichsimplyreportsthesearchyou
ran.
Forexample:
kill 3197
kill 5317
kill 5513
Appendix A
Known limitations
AutoWebcontainsthefollowinglimitations:
Ifapagecontainsanycrossdomainframes,theindexselectionandindexpagebuttons
willnotwork.Formoreinformation,seetheMicrosoftwebsite:
ht t p: / / msdn. mi cr osof t . com/ l i br ar y/ def aul t . asp?ur l =/ wor kshop/ aut h
or / om/ xf r ame_scr i pt i ng_secur i t y. asp
AutoWebwillnotindexURLsthatareredirected.Forexample,ifyouareintheUK
andyoubrowsetowww. memex. comyouareredirectedtowww. memex. co. uk.Asa
resultyoucannotuseAutoWebtoindexht t p: / / www. memex. com.Theworkaround
istoindexaspecificpagebelowtheredirecteddomainforexample,
ht t p: / / www. memex. com/ About Memex/
Ifauserattemptstoindexapagethathascrossdomainframes,thefollowingerror
messageisdisplayed:
Br owser secur i t y r est r i ct i ons pr event you f r omi ndexi ng t hi s
page
WhenAutoWebmirrorsaWebpageitdoesnotautomaticallymirrordocumentslinked
tofromthatpage.Thedepthofmirroringdependsontheoptionsspecifiedinthe
configurationrecord.Asaconsequence,stylesheetsusedbythepage,orimagesthat
appearonthepage,maynotbemirrored.
MemexstronglyadvisesthatyouchangetheInternetsecurityzoneofthemirrorto
disablescripting.AsfilescopiedtothelocalmirrorareonyourlocalIntranet,theymay
havemoresecurityrightsthanishealthy.SelectTools>InternetOptions>Security>
RestrictedSites,clickSites,andaddyourmirrordomaintothelist.
WhenindexingWebsitesusingget si t e. pl ,imagesarenottimestamped.This
meansthatifaWebpagecontainsanimagethatchanges(butkeepsthesamename),
theoldcopyoftheimagewillbeoverwritten.Asaresult,theearlierversionofthepage
willreferencethenewerversionoftheimage.
39
Appendix B
Troubleshooting
IfaWebsiteisnotindexedorisnotindexedinthewayyouexpected:
ChecktheknownlimitationslistedinAppendixA.
MakesureyouareawareofthedefaultindexingbehaviourofAutoWebandthe
variousHTTrackoptions.
Seepage41foralistofthedefaultoptionsandtheonlineUserGuideforHTTrack
WebsiteCopierathttp://www.httrack.com/html/fcguide.htmlforacompletelistof
availableoptions.
Checkthemessagesinthelogfile.Thepathandnameofthisfilearegivenasthevalue
ofthel ogf i l eparameterinthespider.cfgconfigurationfile
(/opt/memex/autoweb/spider.cfg).
Forexample:/opt/memex/logs/crawlerlog.txt
IfyougetthemessageAlreadyRunningResourcetemporarilyunavailablewhenyou
runthegetsite.plscript,itindicatesthatthescripthasnotfinishedindexingpages.This
maybebecausetheconfigurationrecordsarecausingittoindexmorepagesthanyou
hadexpected,orrequire.Ifthishappensyoushouldeitherwaitforthescriptto
complete,orkilltheprocess(asdescribedonpage37),andthenchecktheconfiguration
recordsbeforerunninggetsite.plagain.
Ifthegetsite.plscriptrunsmorefrequentlythanexpected,checktheentriesinthecron
tabfortheMemexadministrator.TheautoinstallerforAutoWebaddscronjobsfor
getsite.pltothecrontaboftheMemexadministratoruser.Iftheautoinstallerwasrun
morethanonce,thecrontabwillcontainduplicatecronjobs,whichmustberemoved
byeditingthecrontab.
40
Appendix C
HTTrack options
HTTrackWebsiteCopierisopensourcesoftwarethatisusedtomirrorWebpages.Memex
hasalteredthesoftwareslightlyforusewithAutoWeb.
Note For more information about HTTrack, visit: http://www.httrack.com/ and
http://www.httrack.com/html/fcguide.html.
ThistableliststheoptionsthatAutoWebusesbydefault.
Option Description
-n
GetnonHTMLfilesnearanHTMLfile
-%e2
Setstheexternallinkdepthto2
-A32000
Setsthemaximumtransferrateinbytes/seconds
-I0
Dontmakeanindexpage
-Qq
Nologandnoquestions
--assume
cfm=text/html,php=text/html
Assumethatatype(cfm,php)isalwayslinkedwitha
mimetype
-X0
Donotpurgeoldfilesafterupdate
-%F ""
DonotputafooterintotheHTMLpages
-%P0
Donotdoextendedparsing
-C0
Donotuseacache
-%Q
Donotfollowanyhyperlinksfromthepage
ThisoptionhasbeenaddedtoHTTrackbyMemex
-d
Stayonthesameprincipaldomain
Thistablelistsotheroptionsthatyoucanuse,ifnecessary.Touseeitheroption,addittothe
optsparameterinthespider.cfgfile.Ifnooptionisset,thedefaultbehaviourisfollowthe
rulesinrobots.txt.SeeSettinguptheAutoWebconfigurationfileonpage15.
41
Memex Technology Ltd A Guide to AutoWeb
Option Description
-s0
WhenretrievingWebpages,donotfollowtherulesspecifiedinrobots.txt
ontheremotewebserver.
-s2
Followalloftherobots.txtruleswiththeexceptionofDisallow:/asthis
willpreventthesoftwarefromretrievinganypagesfromaWebsite.
42
Appendix D
Upgrading to AutoWeb 1.3
IfyouarecurrentlyusingAutoWeb1.0or1.1youmustupgradetoversion1.3beforeyoucan
upgradetoversion2.0.Onceyouhavea1.3systemyoucanupgradeto2.0byfollowingthe
instructionsonpage18.
UpgradingfromAutoWeb1.0or1.1toAutoWeb1.3isatwostageprocess.First,youmust
backupyourpreviousAutoWebsetup;thenyouneedtoinstallAutoWeb1.3.
Important You will need the installation package for version 1.3 of AutoWeb to complete
this procedure.
Backing up your previous AutoWeb setup
Beforebeginningtheupgrade,youshouldbackupyourexistingAutoWebconfigurationand
databases.Iftheupgradeprocessencountersanyproblems,youcanthenreverttoyour
known,validsetup.
Afterthebackupiscomplete,shutdowntheexistingMIEandmovetheAutoWebdirectories
aside.Forexample,ifyouinstalledyourpreviousversionofAutoWebin
/opt/memex/autowebyoushouldmovethiswholedirectoryto/opt/memex/autowebold.
Installing AutoWeb 1.3
AfterbackingupyourexistingAutoWebsetup,youmustperformanew,cleaninstallationof
AutoWeb1.3.
Note You must install AutoWeb into the same directory as your previous version. For
example: /opt/memex/autoweb.
IfyouareusingthisproductwithMemexPatriarch,itisessentialthatyoureadChapter3
UsingAutoWebwithMemexPatriarchonpage24.Youmustperformallthestepsdetailedthere
beforeyouproceedwiththeconversion.
43
Memex Technology Ltd A Guide to AutoWeb
Converting your AutoWeb data
AfterinstallingAutoWeb1.3,youmustrunaconversionscripttoconvertthedatafromyour
previoussetupandcreateanynewdatabasesthatmayberequired.
Setting up the conversion script
Theconversionscriptsreadsaconfigurationfileconvert.confwhichisstoredinthebin
directoryofthenewAutoWebinstallation.ThisfilespecifiesthedetailsoftheAutoWeb
databasesthatwillbeconverted.
Beforerunningtheconversionscript,youmustsetthefollowingoptionstoreflectyour
AutoWebsetup:
Option Details
MIEDecodeDir ThepathtotheMIEinstallationusedbythepreviousversionof
AutoWeb
MIEDir ThepathtothenewMIEinstallation
MIEPort ThenetworkportthatthenewMIEislisteningon
OldAutoWeb ThepathtothepreviousAutoWebsetup
NewAutoWeb ThepathtothenewAutoWebinstallation
IMBase TheinstallationdirectoryforMemexPatriarch(ifinstalled)
TempDir Adirectorytouseforstoringtemporaryfiles
Verbosity HowdetailedtheAutoWeboutputwillbe:
0basicoutput
1tracksprocesseddatabases
2detailedoutput
45