Sie sind auf Seite 1von 41

,"^",._.,"^",._.,"^",._.,"^",. ~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWhatYou Chose~ POWERBROWSING v1.2.1(en) +mala,20050318 malattia(at)gmx(dot)net ~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWhatYou Chose~ ,"^",._.,"^",._.,"^",._.,"^",. 1.INTRO 1.1Howareyoubrowsingnow? 1.2Howdoesitwork,instead? 1.3Notes 2.

TECHNOLOGIES 2.1WhyHTTP? 2.2WhyHTML? 3.PBTECHNIQUES:toolsandbasics 3.1Alternativebrowsers 3.2LeechersandTeleporters 3.3Spidersandscrapers 3.4Proxylikesoftware 4.PBTECHNIQUES:advanced 4.1Learntosearch 4.2Experimentswithcurl 4.3Wgetandlynxoneliners 4.4FightAgainstFlash 5.BOTBASICS 5.1DetectingWebPatterns 5.2Websitenavigationwithbots 5.3DataExtraction 6.PERLPOWERBROWSINGTOOLS 6.1WhyPerl? 6.2PerlPackages 6.3LWP::Simple 6.4LWP::UserAgent 7.SHAREYOURDATA 7.1WebandRSS 7.2Mail 7.3Net::Blogger 7.4TWO 8.Examples

ThispaperisdedicatedtoScottR.Lemmon,Proxomitron'sauthor.Whatyou've startedwillneverstop,andyourideaswillbeforeveralive,aroundtheNet andinsideourminds.

AbigTHANKYOUtoAndreageddon,whohelpedmewiththeEnglishtranslation whenIdidn'thaveenoughtimetodoit:)

==================================================================== ========= 1.INTRO WhatisPowerBrowsing?Thetextyoucanreadaroundthetitleisagood explainationofthisword:PowerBrowsingmeansbrowsingtheWebseeingonly whatyouchosetosee.Evenifthismightseemaneasythingtodo,itisnot anditwillbecomeharderandharderinthefuture...unless,ofcourse,you become"PowerBrowsers". ThistexttriestoexplainhowtheWebworksnowandhowyoucanPowerBrowse, usingreadymadetoolsorcreatingnewoneswhichfityourneeds.

1.1Howareyoubrowsingnow? Howdomostofyoubrowsenow?Well,youprobablyusethebrowserwhichis installedbydefaultinyoursystemwhich,mostofthetimes,meansInternet Explorer.ThechancestocustomizethewayyouseeWebpagesislimitedby theoptionsyourbrowserallowsyoutochange,soyouprobablydownloadall theimagesinsideawebsiteandalltheactivecontentslikeFlashandJava applets,youseemoreandmoreadvertisementsinsideWebpagesandyouhave toclosetonsofpopupswhenyouvisitsomewebsites. I'msurethatsomeofyouarealreadygettingangryforthisgeneralization, becausetheyuseanotherbrowser...hey,maybeanotheroperatingsystemtoo! Well,letmetrytoguessanyway:despiteofthis,youusuallyfollowthe linksyourbrowser(whateveritis)showsyou,youseepagesinsidewindows chosenbyitandyouaccessonlytheURLsitletsyouvisit.Insomecases, togettheinformationyouareinterestedin,youhavetofollowsomefixed pathsinsideawebsite(likeinautogrillsorsupermarkets...andevery website,asasupermarket,wantsyoutoseeallofitsspecialoffers).And last,butnotleast,youalwaysdownloadmuchmoreHTMLcodethanyoureally

need:maybeyoudon'tseeit,butbesureyourmodemnoticesit! Well,anyway,whybother?Maybeit'snotexactlywhatyouwanted,butmodems arebecomingfasterandfaster,andafterallthisiswhatyou'regivenand youcan'tchangeitmuch.

1.2Howdoesitwork,instead? Hey,wakeup!I'vegotapieceofnewsforyou:acomputerisnotaTV!It's supposedtodowhat_you_ask,notwhat_others_wantittodo.Also,things arenotalwaysastheylook:beforetheendofthistext,youwillrealize thatmostofthetimesyou'reabletoseewhatabrowserdoesn'tdirectly showyou,andtohidewhatitshowsyoubydefault. Now,thinkaboutdownloadingonetenthofthedatayouusuallydownload, skippingalltheadvertisements,avoidingpopups,keepinginterestingdata (andonlythem)onyourcomputerandaccessingthemwhileyou'reoffline,in acustom,easierandmoreeffectiveway.Thinkthat,oncethoseinformation areonyourharddisk,youcanwriteprogramswhichworkonthemtoproduce new,evenmoreinterestinginformation.Finally,thinkaboutthechanceof makingallthesedataavailabletoeveryone,maybeinanautomaticalway. AllofthisiswhatIcallPowerBrowsing.

1.3Notes Well,beforeyoureadthistextIthinkIneedtowritesomenotesaboutit. Firstofall,thispaperwasbornasalectureikeptatGenova(Italy)in April,2004.TheslidesIpreparedwerethenadaptedtobecomeafulltext, butofcoursestillhavesomecharacteristicsoftheoriginalspeechyou mightnotexpect. Firstofall,thespeechwasintendedforamixedaudience,soitisthis text:thismeansthat,whilesomeofyoumightfindsomepartsofthistute interestingandotherstoohard,someothersmightbecomereallyboredbefore theyfindsomethingthey'reinterestedin,andotheronesmightfindnothing whichdeservestoberememberedhere.Mysuggestiontoeverybodyistoskip allthepartsyoudon'tlikeand,incasenothingisleft,noproblem:just thinkyou'vesurelywastedlesstimereadingthistextthanmewritingall ofit. Finally,IthinkIcouldnotwritethispaperwithoutthebook"Spidering

Hacks",byTaraCalishainandKevinHemenway(akaMorbusIff),publishedby O'Reilly,whichgavememanyinterestingideasandwhicheverybodywhowants tocreateWebbotsshouldread(and,whynot,maybeBUYtoo,ifyouthink likemethattheauthordeservesit).Betweenthemanyinterestinginfoyou canfindinsidethisbook,therearesomenicenotesaboutbotnetiquette everyoneshouldread(andfollow).

==================================================================== ========= 2.TECHNOLOGIES TounderstandwhathappensinsideyourPCwhenyoudownloadaWebpage,you shouldknowthetechnologiestheWebisbasedon.InsidethistextI'llgive HTTPandHTMLbasicsforgranted,anywayI'lltrytoexplainthingswhileI'm writingaboutthem.Since,ofcourse,mostofwhatI'llwritewillbehardto understandanyway,herearesomelinkstowebsitesyoumightfinduseful: http://www.w3.org/MarkUp/(everythingyouneedtoknowaboutHTML) http://www.w3.org/Protocols/(everythingyouneedtoknowaboutHTTP)

2.1WhyHTTP? Totellyouthetruth,youwon'tneedtoknowallthedetailsregardingthis protocol.Anyway,inthenextsectionsofthistextyouwillreadsomething aboutHTTPrelatedconcepts,andknowingtheirmeaninginadvancewillhelp youmuchandletyouunderstandwhatyourPerlbotswillbeabletodo. Moreover,knowingwhatkindofdatayourcomputerwillexchangewiththe serversitconnectstowillhelpyoutocreatemorestableandmoresecure bots.

GETandPOST GETandPOST"methods"arethefirsttwotermswewilllearn:theyrepresent thetwodifferentwaysyourbrowserusestoaskserversfordata.Without delvingtoodeeplyintothedetails,themainfeaturesofthetwomethods arethefollowingones: GETcanbeusedbothtoaskastaticpageandtosendparameterstoa dynamicallygeneratedone(CGI,PHPandsoon).ThetypicalURLofaGETis like

http://www.web.site/page.php?parameter1=value1&parameter2=value2 (youprobablyhavereadthiskindofurlsbefore,insidetheaddressbarof yourbrowser).TheamountofdatayoucansendwithaGETisquitelimited, alsoyoushouldkeepinmindthattheparamsyousendwithaGETare usuallysavedinsideWebserverlogs(too). POSTisusedonlywhenyouwanttosenddatatotheWebserver.Theamount ofbyteyoucansendishigherthanwithGETandinsideserver'slogsyou willbeabletofindonlytheURLyouPOSTto,andnotthedatayouhave sent.Thisisquiteimportantif,forexample,youwanttocreatesomebots whichautomaticallyauthenticateandloginsideawebsite:inthiscase POSTisbetterthanGET,butkeepinmindthat,iftheconnectionisin clear,yourloginandpasswon'tbesafeanyway(astheyaren'tnowwith yourbrowser). Referer AmongthemanyparametersthatareusuallysenttogetherinsideaGETorPOST request,wheneveryoureachapagefollowingalinkbrowsersusuallysend yourreferertotheserver,thatistheURLofthewebsiteyou'recoming from.Thispieceofinformationcanbeusedbytheserveritselftocheckif you'recomingfromoneofitspages(andnot,forexample,fromyourhard disk:countlesshackscanbemadejusteditingawebpagelocally!),orto createstatisticsabouthowmanyuserscomefromsomeparticularwebsites. UserAgent TheUserAgentisthesoftwarewhichconnectstoaserverandcommunicates withit,askingthepagesyou'vechoosenandreceivingthem.Inpractice, anyway,mostofthetimesthisnameisusedforsomethingelse:it'sthename ofthestringusedbytheapp(whichcouldbeabrowseroranyotherpieceof software)toidentifyitselfwiththeserver. Sometimes,thisverystringisusedbytheservertoallowsomeprogramsto accessapageandtokeepawayothers:thishappens,forinstance,withsome "optimizedforInternetExplorer"websites.Themostintelligentbrowsers (ifyouask,NO,InternetExplorerisnotoneofthem)allowyoutosend differentUserAgentstrings(customorstandard,readymadeones),andif youplantowritesomeseriouspieceofWebsoftwareyoushouldaddthis featuretoo. Cookie Cookiesaretextfileswhichareautomaticallysavedbyyourbrowseronyour harddisk.Theycontaindifferentkindofinformation,mostofthetimes relatedtoyourpreviousconnectionstoawebsiteoryourauthentication.For

thisreason,cookiesareoftendislikedbytheoneswhowanttodefendtheir privacy.However,cookiesareusedalmosteverywherenowandsomewebsites don'tevenworkiftheapplicationwhichconnectstheredoesn'tsupportthem. Proxy Proxiesareprogramswhichforwardyour Webappsrequeststothedesired serversandreturntheiranswerstotheclients.Theirjobcanbeeasily describedwiththis:

++++++ ||request||request || ||>||>|| |CLIENT||PROXY||SERVER| ||<||<|| ||answer| |answer|| ++++++

Proxiesareveryusefulformanydifferentreasons: theclientmightnotbeabletoaccessservers,butitmightbeauthorized toaccesstheproxy:inthiscase,clientappscouldreachtheserver anyway,passingtheirrequeststhroughtheproxy someproxieshaveacache,insidewhichthemostfrequentlyrequestedfiles aresaved.So,iftheconnectionbetweentheclientandtheproxyismuch fasterthantheonebetweenclientandserver,thenyoumightbeableto downloadcachedfilesmuchfaster someproxiesdon'ttelltheserverwhererequestscomefrom,sotheclients areabletoconnectanonymouslytotheNet later,we'llseehowyoucanuseproxiestogetmanymore,andevenmore interesting,advantages.

2.2WhyHTML? BecauseeverythingyourbrowsershowsyouhasitsownHTMLsource,created byhandbysomeone,generatedbysomeprogramorbyascript.Forthisreason youshouldn'tknowonlyHTMLbasics,butyoushouldbeabletounderstandif thecodehasbeenautomaticallygeneratedornot,soyoucanusemultipletag occurrenciestosplitapageinsectionsandextractitscontents.

Sinceinsidedynamicwebsitesformsarealwayspresent,youshouldspendsome timetounderstandhowthistechnologyworks.Totellyouthetruth,it'snot suchahardwork,anywaysomeexperienceinthisfieldwillhelpyounotonly toeasilyunderstandthesyntaxandthemeaningofwhatyousee,butalsoto understandhowawholewebsiteworks.Andwithoutanyparticularintrusive technique,butonlywithformsandhtmlknowledge(andwell,yes,somebrain too),you'llbeabletocreatebotswhichcandomuchmorethanyoucan imaginenow.

==================================================================== ========= 3.PBTECHNIQUES:toolsandbasics AroundtheNetyoucanfindlotsofreadymadePowerBrowsingtoolsforfree (or,ifyouprefer,soldforalotofmoney),foranyoperatingsystem. Describingallofthemindetailisimpossible,sowe'lltrytogroupthem incategoriesanddescribetheirmaincharacteristics.

3.1Alternativebrowsers Alternativebrowsersare,basically,alltheoneswhicharenotInternet Explorer.Evenifthey'renotthedefinitivesolutiontoanyproblem,their choiceisthefirststepyoucandotofreeyoursystemfromunrequested contents,suchasadvertisingbanners,flashmenusandpopupwindows. Opera,forinstance,allowsyoutotoggleimageloadinginaWebpage(orto disablejustnoncachedimages,suchasbanners)withasimpleclic,while thesameoperationwithIErequiresyoutobrowseforawhileinsideallthe configurationmenus.OperaalsoallowsyoutoquicklytoggleJavascriptor otherdynamiccontentsand,inthesameway,youcanchooseacustomviewof Webpages,withfontsandcolorschosenbyyouinsteadtheonesdecidedby thepagecreators. BetweenallthebrowserswithaGUI,I'veheardverygoodcommentsabout Firebird.Ihaven'ttrieditenoughtowriteaboutithere,soI'mwaiting formanyfeedbacksand"PowerBrowsinghints"aboutit.Anyway,withoutgoing farfromthis,evenMozillais,formanyreasons,abetterchoicethanMSIE, andcomparedtomanynewerbrowsersitalsohasagreatercompatibility withmanywebsites(whichyoupaywithaheavierandmorebloatedapp). If,instead,whatyouwantisonlythetextofaWebpage,youmightconsider

theideaofusingatextualbrowserlikelynxorlinksorw3c:thespeedof pagedownloads,withoutalltheheavycontents,willamazeyou.Moreover,as you'llbeabletoseelater,loadingthecontentsofapagefromthecommand linegivesyouthechancetooperatewithWebinformationinamoreeffective evenifmaybelessconventionalway.

3.2LeechersandTeleporters Theprogramspertainingtothiscategoryallowyouto"leech"or"teleport" contentsfromtheWeb.Bothofthemusuallydownloadlotsoffilesonyour harddisk:thefirstonesaremoresimple,gettingeveryfileofsomekind orallthecontentsofadirectorythesecondones(whosenamederivesfrom theoldwinapplicationTeleportPro)allowyoutogetthecontentsofa website,automaticallyfollowingitslinksandtryingtoreplicateonyour diskitsinnerstructure(thatis,"mirroring"it). Eveninthiscase,therearesomanyapplicationsthatit'simpossiblenot onlytodescribe,buteventoknowallofthem.Personally,theonesIliked mostduringtheyearswereTeleportProandGetrightunderWindows,andwget, curlandlynxunderLinux.Infact,thedistinctionbetweenWindowsandLinux appsisnotsosharpnow,asnowyoucanoftenrunprogramsofoneOSunder theother:theresultisthatnowthetoolsIusemostfrequentlyarewget, lynxandGetRight(stillanoldversion,butveryusefulforits"GetRight Browser",whichconnectstoaWebpageandshowsallthelinkedfiles, allowingyoutochooseanddownloadonlytheonesyou'rereallyinterested in). Probablyyou'vealreadynoticedthat,eveniflynxisabrowser,itfellinto thiscategorytoo.Thishappenedbecause,thankstoitssourceanddump switchesandsinceitsoutputcanbepipedtootherprograms,itcanworkas aleechertoo.Asyouwillseelater(insection4.3),usingitasaleecher youcandownloadfilesandinformationinaveryefficientway.

3.3Spidersandscrapers Since,untilsomemonthsago,Ididn'tevenknowthiskindofcategorization, Idecidedtoquote(moreorless)thetwodefinitionsIfoundon"Spidering Hacks": "spiders"areprogramswhichautomaticallyretrievedatafromtheWeb. Usually,they'resmarterthanteleporters(theydon'tjustfollowallthe linkstheyfind,buttheycanchoosethemdependingonsomealgorithm)and

theyusuallydownloadthewholecontentofthepagestheyfind "scrapers"areprogramswhichextractonlysomespecificpartsoutof Web pages.Infacttheyoftenhavetodownloadwholepagesanyway,howeverthey cansaveonyourdiskonlythedatayou'rereallyinterestedinandnotall thepages'contents.

Youwon'tprobablyfindmanyreadytousespidersandscrapersaroundtheWeb soeasily,howevertherearesomeinterestingones:forinstance,liberopop (withallitsvariants)isaprogramwhichallowsyoutodownloadwithyour emailclientthemessagesyoureceiveinyourmailbox,whichcouldotherwise bebrowsedonlyontheWeb.Ofcourse,theproblemofproviderswhichfirst offerfreemailandthenclosetheirpop3serversisnotnew,anddifferent solutionshavebeenfound,duringtheyears,tofightthistrend:justgivea lookat"Perl@usa.net"paperbyBlue(whichyoucanfindonanyfravia's mirror,inthebotbotstart.htmsection),dated1999!Anyway,Iliked liberopop'sstructure

++++++ ||POP3req.||HTTPreq.|| ||>|POP3|>|| |CLIENT||EMULATOR||SERVER| ||<||<|WEB| ||POP3ans.||HTTPans.| | ++++++

whichnotonlyremindsaproxyverymuch,butisalsoalmostidenticaltothe oneIdesignedforanoldappofmine(ANOAnotherNonworkingOffline forumreader),whichallowedyoutoreadwebforumswithyouremailclient. Ifyou'reinterestedinthiskindofapps,youmightlikeoneofmylatest projects,calledTWO(TheWorkingOfflineforumreader),ascraperI'll describewithalittlemoredetailinsection7.4.

3.4Proxylikesoftware Theprogramswhichfallintothiscategoryusethesamearchitectureofproxy serverstogetresultsyoudon'tusuallyexpectfromanapplicationofthis kind.Thankstotheirintermediatepositionbetweenclientandserver(they areoftencalled"middlewares"),theycantransparentlyworkondata travellinginbothdirections:forinstance,theycanfilterinformation comingfromtheserverandtakeonlywhatyou'reinterestedin,orchanging

pagesontheflyinthesameway,theycanreadthedatayousendtothe serverandusethemlater,toautomaticallyauthenticate,logandbrowseinto awebsite. Anexampleofthefirstkindofproxy(thatis,theonewhichfiltersdata comingfromtheserver)isProxomitron,asmallbutgreatWindowsapplication whichallowsyoutofilterWebpages,cuttingawayeverythingyoudon'tlike (banners,popups,javascript,spyware)andcompletelychangingtheirlook. Tryingtocreateaplatformindependentversionofthisprogram,agroupof reversersisworkingonaproject,calledPhiltron,whichaimstocreatea PHP,Proxomitroncompatibleapplicationwithevenmorefunctions.Youcan findmoreaboutthisprojectattheseURLs: http://philtron.sf.net(mainprojectpage) http://fravia.2113.ch/phplab/mbs.php3(PHPLabs) http://fravia.2113.ch/phplab/mbs.php3/mb001(Seeker'smessageboard) Forwhatconcernsthesecondproxytype(theonewhichworksondatayousend totheserver),agoodexampleistheWebScrapingProxy:thisPerlappcan "record"everythingyoudoinsideawebsite,thenitautomaticallycreates thesourcecodeneededtomakeaperlbotwhichwillmimicallyouractions. Toknowsomethingmoreaboutthisprogram,checkthewebsite http://www.research.att.com/~hpk/wsp orgivealookatthe"hack"(number30)insideSpideringHacks. Note:I'verecentlyfoundanotherperlpackagewhichshoulddothesame thing.It'scalledHTTP::RecorderandyoucaneasilyfinditatCPAN (trythesearchengineathttp://search.cpan.org).

==================================================================== ========= 4.PBTECHNIQUES:advanced MoreadvancedPowerBrowsingtechniquesdon'tjustuseasingleprogram,no matterhowadvanceditsoptionsare:theyusuallytakeadvantageofboththe knowledgeacquiredbyusersandthefunctionsprovidedbyoneormoretools, whichcanbereadymadeonesorcreatedadhoc.Thiswayyougetnew,more powerfulandeffectivetools.Moreover,readingaboutmoreandmorecomplex experiments,you'llnoticehowwe'llmovefromthesimplefiledownloadtoa moregeneral(and,IMO,muchmoreinteresting)_informationretrieval_. Ofcourse,PowerBrowser'sskillsarefundamentalhereandthehigherisyour

experience,thebetterresultswillbe.Anyway,evenwiththefew,simple suggestionswhichfollowyou'llbeabletofindanddownloadmuchmoreeasily everythingyoulike. Speakingaboutexamples,keepinmindthattheyarenotsupposedtobevery usefulinpractice.Moreover,giventhespeedwithwhichWebinformation change,someofthemevenmightnotworkanymorewhenyoureadthisdocument. Don'tworrytoomuchaboutthis:readthem,understandthem,trytochange themandbesurethatyou'llhavesomethingmorethansomelinesofcode.

4.1Learntosearch Asyouprobablyhaveunderstoodyet,ifyouwanttoseeonlywhatyou're interestedinyoufirsthaveto_find_whatyou'reinterestedin.Onthe otherside,insomecasesyoumightalreadyknowwherethatfileyouwanted todownloadresides,butforsomereasonyoudon'twanttoconnecttothe websiteit'sstoredin(forinstance,becauseyouhavetopaytodothat): eveninthiscase,knowinghowtosearchwillhelpyoutofindalternative sitesfromwhichyou'llbeabletogetthesamefile. AfirstsuggestionI'dgivetoanyoneistovisitthegoodSearchloressite (http://searchlores.org).Insideityouwillbeabletofindmanytutorials aboutWebresearchandtechnologies,insideaplacewhichiscompletelyfree fromadvertisements,bannersandcommercialstuff.Inparticular,Isuggest youtogivealookat"searchwebbits",adhocsearchstringsforsomefile orinformationcategories. Betweenthese,the"indexof"trickisoneofthebestones,evenifsome commercialwebsitesarealreadytryingtouseittoattractsearchersto theirpages.Inpractice,youcanuseittorestricttheWebsearchtothose directorieswhichareopenontheWeb:theseonesarenothingmorethanlong filelists,andalwayshaveattheirbeginningthe"Indexof<dirname>" headerandalinktotheir"parentdirectory". Now,ifforinstanceyouwanttodownloadringtoneswithoutpayingacent, whydoyouhavetogetlostbetweendialerandcommercialwebsiteswhenyou candownloadallthesongsyoulikeinmidiformat?Tofindthewebsites whichsharethemfreely,youjustneedtofeedgooglewiththefollowing string: "indexof""parentdirectory"ringtones.mid Anotherexampleisaboutthosefunnyvideosyoudownloadwhenyoudon'tknow howtospendtimeinsideyouroffice.Withthestring

"indexof""parentdirectory"fun.mpg (change.mpgwithyourfavoritevideoformat)youcanfindallthevideosyou likeand,ifyou'relucky,evensomewebsiteswhichareregularlyupdated withnewfunnyresources. If,instead,youwanttotrythesametrickwithmp3s,you'llprobablyfind lotsofwrongresults,whichlinkyoutocommercialwebsites.Thishappens because,asItoldyoubefore,onceyoustartusingfrequentlyatrick"on thisside",thisbecomesused"ontheotherside"(thechoiceofwhichside isthedarkoneisleftasanexercisetothereader)toattractpeoplewhere theydon'twanttogo.Fortunately,regardlessofhowmanytechniquesthey'll trytousetotrickus,wewillalwaysbeonestepbeforethem) Quitebanally,ifyouwanttotakeawaymostofthefakeresultsfromyour search,youcantrytocutawaytheclassicwebpagesextensions.Ifyousee youstillgettoomanyresults,youcanaddsomefiltersonspecificterms: "indexof""parentdirectory".mp3IronMaiden.html.htm faq Here,forinstance,we'researchingsomeIronMaidenmp3s,cuttingawayHTML pages(whicharenotinterestingforthisresearch)andtheresultswhich containthe"FAQ"word,becauseinsomenewsgroup'sFAQsomebodytalkedabout bothIronMaidenandmp3sandsomeoneelsehadthegreatideaofmirroring themalmostanywhere. If,finally,youevenknowthesong'stitle,youcantrytoinsertitslast word,joinedtothefileextension,oraddapartofthetitletothesearch strings: "indexof""parentdirectory"Metallicafrantic.mp3.html.htm

AnothersuggestionIcangiveyouaboutwebsearchingistousewebtrackers. ManyWebsitesusethemtohaveaccessstatstostudy:whatmanydon'tknow, anyway,isthatmanycommercialtrackersareopentoanyoneandletyousee, betweenmanydifferentstats,referrerstoo.Forinstance,trytogivealook atmywebtracker: http://extremetracking.com/open?login=alam AsI'vetoldyoupreviously,referrersaretheURLsfromwhichuserscame whentheyhitawebsite,inthiscasetheonewhichusesthewebtracker.Of course,there'sahighchancethatmanyreferrerwebsiteswillbedevotedto thesametopic:so,you'lljusthavetosearchongooglethetopicyou're

interestedinandthenameofawebtracker(oritsURL,orastringwhich uniquelyidentifiesit)tostartdelvingdeepinsideaminefullof potentiallyinterestinglinks. Alastsuggestion,ifyou'researchingaparticularfileandespeciallyif it'sacopyrightedone,istogivealooktopeertopeerchannelsfirst:the downloadwillprobablybeslowerthantheonefromaWebsite,butyou'll havemuchmorechancestofindamovieorabook,evenifyouhaveonlya coupleofwordsfromitstitle...And,onceyouhavethefilename,youmight evendecidetogetbacktonormalWebsearch. If,whilesearchingforaparticularfileonap2pnetwork,youhappento findthenamesofgroupswhichperiodicallyreleasefilesofthesamekind (forinstance,horrormoviesorcomicsortvseriesorwhateverelse),write themdown:nexttimeyou'llhavemorechancestofindwhatyou'researching for,usingthesenamesbetweenyoursearchstrings.Inthesameway, following(whilepayingattention,ofcourse)thelinksyoucanfindinside theclassic.nfofiles,Ihappenedtofindsomemonothematiccommunities, muchmorespecializedandfullofcontentsthananysearchengineIused.

4.2Experimentswithcurl Insidesomewebsitesyoumighthappentofindlonglistsoffilesofthesame type,whosefilenamehasalwaysthesameprefixfollowedbyanincremental number.Ifthefilelistisvisible(thatis,browseable)fromtheWeb,with anHTMLpage,orbecausethedirectorywherethefilesarestoredis accessible,theeasiestmethodtodownloadallthefilesisalwayswget: wgetm nphttp://web.site.url/directory/ Ifyouwant,youcanspecifytheextensionofthefilesyouwanttodownload: wgetm npA<extension>http://web.site.url/directory/ Unfortunately,directaccesstodirectoriesisoftenclosedandmanysites don'tgiveyouafileindex,butforceuserstodownloadatleastasmany pages(withunusefulimages,bannersandpopups)asthefilesyouwantto download.Forthistask,themostusefulutilityyoucanfindiscurl. Curlallowsyoutochoose,throughthecommandline,oneormoreURLs,andto specifywhichpartsofthemchange(puttingthembetweencurlybrackets)or whicharethesequenciesofalphanumericcharacterstheprogramhastotry (writingthembetweensquarebrackets).Forinstance,

http://web.site.url/directory/file[1100].txt http://web.site.url/directory/file[001100].txt http://web.site.url/directory/file[az].txt http://web.site.url/directory/file[14]part{One,Two,Three}.txt allowyoutodownloadallthefileswhichbeginthesamewayandcontinue with,respectively, numbersfrom1to100 numbersfrom1to100(paddedwithzeroestoalwayshave3digitnumbers) lettersfrom"a"to"z" numbersfrom1to4,followedby"partOne","partTwo","partThree" Curlhasmanymoreoptionsthantheonesdescribedhere.Italsosupports variousprotocols(HTTP,HTTPS,FTP,GOPHER,DICT,TELNET,LDAP,FILE)and canbeusedforfileuploadtoo(toknowmoreaboutthis,write"mancurl" insideyourshell).However,itstillhassomelimitations:forinstance,it stillcannotefficientlymanagefilenameswhichcontaindatesinsidethem. Ifyourunthiscommand curlLOhttp://web.site.url/dailystrips/[20002004][0112][0131].gif youwilldownloadalltheimagesyou'reinterestedin,butyou'llsendthe servermorerequeststhanyouneed,tryingforexampletodownloadimages datedFebruary,30thorJune31st... Therearemanytechniquestosolvethisproblem:betweenthem,therearesome youwillseeinsidethenextsectionandthatwillallowyoutoautomatically download,everyday,yourfavoritedailystrips.

4.3Wgetandlynxoneliners Insidethissectionyouwillhavethechancetoseesomeonelinerswhichmake useofwgetandlynx.TheyaretheresultofanoldprojectInamed"Browsing theWebfromthecommandline",whichhadsomecontributorsinsideRETforum (http://www.reteam.org).Theprojecthasbeeninactiveforalongtime,but sinceit'spartofPowerBrowsingnowyouarefreetocontribute,sending comments,requestsornewexperiments.Thanksinadvance:) Thefirstonelinerwe'llseeallowsyoutodownloaddailystripsfrom "kevinandkell.com"website(butitwon'tbemuchdifferentforothersites). Itookitfromhttp://www.rollmop.org/junk/oneliners.html,whichhasanice collectionyoumightwanttocheck(thewholewebsiteisquiteinteresting indeed).

Theonelinergetsthecurrentdateusingthe"date"command:togetmore informationaboutit,run"datehelp"attheprompt. #=================================================================== ========= wgethttp://www.kevinandkell.com/`date+"%Y"`/strips/kk`date+"%Y%m%d"`.gif #=================================================================== ========= Notehowdateiscalledtwice:it'sbetweenbackquotes(inthisway,the programoutputbecomespartoftheURLstring)andit'spassedthe+""param, insidewhichthedesireddateformatisspecified.Thecodesusedtospecify thedateformatarethefollowingones(takenfromthehelp): %Yyear(1970...) %mmonth(01..12) %ddayofmonth(01..31) So,ifyourunthiscommand,forinstance,onApril,2nd,2004,wgetwould trytodownload http://www.kevinandkell.com/2004/strips/kk20040402.gif Oneofthemostinterestingfeaturesofthesecommandsisthatyoucanmake themrunautomaticallyeverytimeyoupoweronyourPC,oreverydayifyour computerisalwayson:thisway,you'llhaveallthefilesyouwantalways readyonyourharddisk.Ifbetweenthewebsitesyouwanttoperiodically checktherearesomewhicharenotupdateddaily,youcanchangethebehavior ofyourscript:forinstance,youcanaddsomecodetotellyouthatona particulardaythescriptcouldnotdownloadanyfile: #=================================================================== ========= if[`date+"%w"`lt6]&&\ [$((`date+"%w"`%2))=1] thenwgethttp://www.goats.com/comix/`date+"%y%m"`/goats`date+"%y%m%d"`.png elseechonogoatstoday fi #=================================================================== ========= Inthisscript,thedatecommandusedwith"%w"parameterreturnsthedayof theweek(from1to7).Ifthedayislessthan6(thatis,ifit'snot saturdayorsunday)orifit'sodd(thatis,onlyifit'smonday,wednesday orfriday)thentheimageisdownloadedotherwise,theprogramtellsyou

thatnopicsareavailable.Onthesamedayusedforthepreviousexample, thais2004/04/02,theprogramwoulddownloadthefile http://www.goats.com/comix/0404/goats040402.png Inbothpreviousexamplesweusedwgetintheeasiestwaywecould,thatis directlypassingittheURLwewantedtodownload.However,thisprogramcan automaticallyexecutemoreadvancedoperations,likethefollowingexample shows: #=================================================================== ========= wgetAmpg,mpeg,avi,asfrHl2ndt1http://url.you.like #=================================================================== ========= Here,wgetisusedalone,butwithmanydifferentswitches.Theirmeaningis thefollowing: Acommaseparatedlistofacceptedextensions rrecursivewebsuck Hgotoforeignhostswhenrecursive l2maximumrecursiondepth=2 nddon'tcreatedirectories t1setnumberofretriesto1 Let'strytoreadthesingledescriptionsandjointhemtounderstandwhat thiscommandreallydoes:wgetfirstconnectstothespecifiedURL,thenit recursivelyfollowslinks(r)toamaximumdepthof2(l2),exitingthe mainwebsitetoconnecttoexternalserverwhenneeded(H)thenitsaves, insidethecurrentdirectory(nd),allthefileswhichhavempg,mpeg,avi orasfextensions.Now,ifyouthinkaboutthosewebsitesfulloflinksto othersiteswhichcontainlotsoffreevideos,you'llunderstandwhythis commandhasbeennamed..."wgetpoweredporn"! Fewlinesagowewerespeakingaboutlinkcollections:whatifwewantedto extractallthelinkscontainedinsideapage,nottofollowthembutjustto savetheminsideatextfile?Forthistaskwecanuselynx,joinedwithsome othertools: #=================================================================== ========= lynxdumphttp://www.reteam.org/links.html\ |sed's/^*[09]*\.[^h]*//'\ |grep'^http'

#=================================================================== ========= Hereweusethreeprograms,pipingtheoutputofeachonetotheinputofthe followingone: lynxdowloadsthespecifiedpageandsendsitsdumptosed(keepinmind thatlynx"dump"contains,attheendofthetext,thelistofallthe linkshttpornotpresentinsideapage) sedleftalignsallthehttplinksanddeletesalltheotherones grepextractsjustthelineswhichstartwith"http" Inthisway,withjustonelineofcommands,we'veextractedallthelinks insideapage,gettingsomenewinformationwithoutevenopeningabrowser. Ofcourse,lynxhelpedusmuch,givingusapagecontainingallthelinks, butthingsaren'talwayssoeasy:amoreadvancedoperation(butnot,for thisreason,muchhardertoaccomplish)isthefollowing. #=================================================================== ========= lynxnolistdump'http://www.reteam.org/board/viewforum.php?f=3'\ |grep2"Browsingtheweb"\ |tail1\ |awk'{print$1}' #=================================================================== ========= Inthiscase,weusedfourprograms: lynxconnectstothespecifiedURL,wherethere'saforum,anddownloadsa pagecontainingtheforum'sthreadlist grepextractsthethreadwhichcontainsthewords"Browsingtheweb"with a2lines"context"(thatis,ittakes2linesbeforeand2linesafterthe lineinwhichthestring"Browsingtheweb"isfound) tailgetsthelastlineofthecontext,thatisthesecondoneafterthe subjectline awkreturnsthefirstwordinsidethisline Ofcourse,ifyouwantasequenceofprogramslikethistobeusefulyou don'tjusthavetounderstandhowthesingleprogramswork,butyoushould alsoknowhowtheforumisorganizedandhow,insidetheWebpagessources,

"static"and"dynamic"dataarestored(thefirstonesneverchange,while theotheronesusuallychange,likeamessagesenderorapostsubject).As Isaidpreviously,thisisaskillyoucanmasteronlywithexperience:so, beforegivingyousomemoresuggestions,I'llshowyouanotherexampleand I'llexplainitwithmoredetail.BereadyfortheFightAgainstFlash!

4.4FightAgainstFlash WhydowehavetofightagainstFlash?Flashisagoodtechnologyfromthe communicativepointofview,itallowsyoutoobtaingoodgraphiceffects withoutgeneratingtooheavydownloadsandit'scompatibleatleastwiththe mostusedbrowsers. Infact,Flashisnotsuchabigproblem.Itsprogrammers,instead,area HUGEproblemwhentheyuseitthewrongway:ifyouhavetocreateawebsite onlyforpromotionalpurposes,100%flash,Icanunderstandthatatextonly versiondoesn'thavemuchsensehowever,wheninsideawebsitewhichshould provide_information_youfindaflashmenuwithoutalternativecodeto supportlessusedbrowsers,thisbecomesahugeproblem.Andifononeside, fortunately,flashislesstrendythanitwassometimeago,youjusthave tosearchwithgooglefileslike"menu.swf","leftmenu.swf"or"navmenu.swf" tounderstandhowmanyofthemarestillaround. Flashmenusareaproblemfordifferentreasons:firstofall,Imightbe interestedintherestofthewebsiteandIdonotwanttojustleavethe websiteandswitchoffmyPCmoreover,becauseofthesemenusnotonlytext browserslikelynxcannotaccesssitecontents,butalsospecialdeviceslike braillebarsmighthaveproblems.And,evenifinmycaseImightjustopen anotherbrowser,otherpeoplecannotchooseandsothiswebsites'contents areprecludedtothem.Thisisthereasonwhy,inmyopinion,it'sgoodto fightagainstflashnowand,moregenerally,againsteverythingwhichis goingtolimitinternetuseinanyway. Totellyouthetruth,myfightingskillsarenotparticularlyaggressive: Ihavesimplytriedtounderstandhowflashfileswork,toextractasmany infoasIcouldfromthemandfinallytorebuildthecontentsofthemenus whichweren'taccessible.Todothis,I'veusedthreedifferentapproaches, eachonemoredifficultand,atthesametime,moreeffectivethanthe previousones.Istillhaven'tfoundauniversalsolution,butI'msurethat ifyouworkontheseexamplesyou'llbeabletodosomethingbetterthanme. Theeasiestwaytoextractinformationfromanswffileistoshowits stringswiththe"strings"command.Ifthefilewe'reworkingonisamenu, wecansupposethatlinksarewhatwe'remostinterestedin:wecanthen

runstrings,askingittoopentheswffile,andthenredirectitsoutputto grep,whichwillextractonlythestringscontaining"http": lynxsourcehttp://www.reteam.org/top.swf|strings|grephttp Thisfirstapproachalreadygivesyousomeresults,andwithsomemenusit worksperfectly.However,itdoesn'ttakeintoaccountmanyotherlinktypes: forinstance,forthewebsiteswherelinksaremanagedbysomejavascript code,you'llhavetochangethestringpassedtogrep: lynxsourcehttp://www.mypetskeleton.com/mpsmain.swf|strings|grephttp lynxsourcehttp://www.mypetskeleton.com/mpsmain.swf|strings\ |grepjavascript Workingonstringsyoucangetmanyotherinterestinginfo,butstillthe linksproblemisn'tsolved:asyoumighthaveguessedyet,thiskindof approachisquiteblind,tryingtofindsomethinginterestinginsidealotof datawhichmightormightnotcontainwhatwereallywant.Tosolvethis problem,I'vegivenalooktovariousswffiles,tryingtounderstandwhich formattheyusetosavelinksinsidethem:inalltheflashfilesI've checked,everylinkwasrepresentedbythesequence 0x000x830xLEN0x00"string"0x00 whereLENisthestringlength,and0xistheprefixIusedtoshowthatall thevaluesaresavedinhexadecimal. OnceIfoundthis,creatingasmallscriptwhichextractedallthelinkswith aregularexpressionwasjustamatterofseconds.Ifyoudon'tknowwhata regularexpressionis,runandstudythem!Otherwise,hereyouarethescript source: #=================================================================== ========= #!/usr/bin/perl undef$/#enableslurpmode $_=<> #SYNTAXIS:0x000x830xlen0x00"string"0x00 while(/\x00\x83.\x00(.*?)\x00/gs){ print"$1\n" } #=================================================================== =========

Yep,it'sallthere(youunderstoodwhyyouhavetostudyregexps,didn't you?).Infact,it'snothingmorethanaloopwhichsays"whileyoufind bytesequenciesliketheoneIdescribedbefore,extractthe'string'part andshowit".Theusageofthisscript,supposingtheswffileissomewhere ontheWeb,isthefollowing: lynxdumphttp://web.site.url/menuname.swf|perlflash.pl Forinstance: lynxdumphttp://www.reteam.org/top.swf|perlflash.pl Thankstothescriptwecreated,wecannowreconstructthefulllistof linksthemenuwaspointingto,howeverwecan'tretrievethetextofthe menuitems(alsobecauseit'ssometimesreplacedbyimages).So,howcanwe knowinadvancewherethelinkwe'veextractedwilltakeus? Thefollowingscriptisa_lookahead_linkextractor:itdoesthedirtywork insteadofus,followingthelinksitfindsinsidetheFlashfile,extracting thetitlefromtheWebpagesitvisits(ifitfindsany)andusingitto rebuildanHTMLmenu.

FlashLookaheadLinkExtractor #=================================================================== ========= #!/usr/bin/perl useLWP::Simple#usedtogetlinkedpages useURI::URL#usedtoabsolutizeURLs subrel2abs{ my($rel,$base)=@_ my$uri=URI>new_abs($rel,$base) return$uri>as_string } $url=$ARGV[0] $flash=get($url)||die"Couldn'tdownload$ARGV[0]" #SYNTAXIS:0x000x830xlen0x00"string"0x00 while($flash=~/\x00\x83.\x00(.*?)\x00/gs){ my$nextitle my$link=rel2abs($1,$url) my$nextpage=get($link)

if($nextpage=~/<title>(.*?)<\/title>/i){ $nextitle=$1 }else{ $nextitle=$link } printqq|<ahref="$link">$nextitle</a><br>\n| } #=================================================================== ========= Thissourceisalittlemorecomplexthanthepreviousone,butIassureyou there'snotmuchmoretoknowtounderstandit:inthemeanwhile,justbe awarethattheusageofthescriptfromtheshellisthefollowing: perlflash2.plhttp://url.you.like/menu.swf Notethat,inthiscase,wedirectlypasstheURLtothescriptinsteadof usinglynx:inthisway,theprogramwillbeabletousethebasemenuURLto replacealltherelativelinksitfindsinsidetheflashfileintoabsolute ones.Thisoperationisdonebythe"rel2abs"sub,whichusesareadymade commandfromtheURI::URLpackage.TheotherpackageIusedisLWP::Simple,a packageyou'llseemoreindetaillater:fornow,justknowthatitgivesyou theuseful"get"command,whichjustneedsaURLtoreturnyoutheHTML sourceofthematchingWebpage.Insidethescript,thissourcecodeissaved insidethe$nextpagevariable,whichismatchedwiththeregularexpression toextractthepagetitlewheneveroneexists.

==================================================================== ========= 5.BotBasics DuringourwaragainstFlashwe'veseenthefirstPerlrobot(ormoresimply bot)example:itbrowsesinsideawebsiteforus,followingthelinksithas extractedfromaflashmenu,anditrebuildsthesamemenuinHTML,addingto everylinkthecorrespondingWebpagetitle. Butwhatidentifiesabotandwhatshouldoneoftheseprogramsbeableto do?Well,despiteofmanyhistoricalorrigorousdefinitions,agoodWebbot shouldatleastprovidethetypicalspiderandscraperfunctions:thatis,it shouldbeabletomoveinsideawebsite(followinglinks,fillingforms, authenticatingitselfandsoon)andtodownloadfullWebpagesorjustpart ofthem,extractingthemostinterestinginformationandthrashingeverything else.

Ofcourseabotisn'tabletodoeverythingalone,suchasrecognizing"what isinteresting"orfollowingsomelinksinsteadofothers.Therobot,in general,isamachineinsidewhichthemandoeswhathewants:so,evenin thiscase,wefirstneedahumantofirstvisitthewebsitesthebotshould beabletobrowseandtodecidewhatitwillconsiderinterestingandwhatit willhavetoignore.

5.1DetectingWebPatterns Humaninterventionwhilecreatingabotisfundamental:thisisbecausethe machineisn'tabletodirectlyworkonreality,butjustworksonamodel, builtbyman,whichisbasedonapersonalinterpretationofreality.For instance,whilewecanrecognize,insideforummessages,amessagesubject (thatis,weimmediatelyperceivethepage_semantics_,regardlessofhowit iscoded),abotcanjustunderstandthatthesubjectisbetweensomepieces ofHTMLcode(thatis,itworksona_syntactical_level,strictlydependent onhowthepageiscoded). Ofcourse,alltheworkwhichisbeingdoneonthe"SemanticWeb"should allowmachinestoaccesstheWebmoreeasily.Also,theuseofAIpowered wrapperscouldgivesomegoodresults...butthisisanotherstory:fornow, keepinmindthat,forsometime,you'llstillhavetodoagoodpartofthe work. WhataretheWebPatternswenamedinsidethetitle?They'renothingmore thanparticularschemasyoucandetectinsideoneormoreWebpages' contents,orinsidethepathswe'reusedtofollowwhen,clickingfromone linktoanother,wemoveinsideawebsite.They'reexactlywhatweneedto understand,ifwewanttoteachabottodowhatweusuallydo,thatis 1)visitingawebsiteand 2)extractinginformationfromit Inthefirstcase,thequestionwehavetoansweris:isthereawaytomake aprogramfollowsomelinksautomatically?Ofcoursethereis,andtheproof areallthesoftwareswhichcanmirrorawebsite.Butisitalsopossibleto makeabotCHOOSEsomelinks,andifsowhichones? Insomecasestheansweriseasy:forinstance,for"wgetpoweredporn"we justaskedtofollowallthelinksanddownloadallthevideosfoundata depthoftwolinks.Thisworkedbecauseweknewthatthestructureofthe websiteswewantedtousewgetwithwasthefollowingone:

Websitecontaining1Websitecontaining2Video linkstoothersites>linkstovideos >Files

Inothercasestheanswerislesstrivial:forinstance,wemighthaveto dowloadalltheimagesfromagallerywhereeverythumbnailhasalinktothe matchingimage,andineverypagethere'salinktothefollowingone,butwe don'tknowhowmanypageswehavetofollow.Inthiscase,weshouldtellthe bottocollectallandonlythefilesmatchingthedesiredimages,andto followthelinkstothenextpagesuntiltheyfindthem,oruntiltheyfind somekindofterminator(forinstance,alinktothefirstpageofthe gallery). Andwhatifwewantedtosaveallthemessageswhichresideinaforum?In thiscase,wewouldprobablyhaveastructurewhichisverysimilartoa gallery,butwithadeeperhierarchy(Forum>Threads>Messages)andwithout afileextensionhelpingustorecognizetheinformationwewant. Asyoucanunderstandfromtheseexamples,beingabletorecognizesome navigationalpatternsmighthelpyoumuchwhiledevelopingabot.Ofcourse, aswithadvancedPowerBrowsingtechniques,someexperienceisneededhere too.Evenifthey'reimplementedindifferentwaysfromonewebsiteto another,navigationalpatternsaremoreorlessalwaysthesame:thismeans thatthemoreexperienceyougaintheeasiestyourjobwillbe. AsItoldyoubefore,wedon'tjusthavenavigationalpatternsbutwecan alsofindsomeschemasoccurringfrequentlyinsidepagecontents,inasingle websiteoreveninsidedifferentones.Beingabletofindthesepatterns allowsyoutoteachyourbothowtoextract,fromeachpage,onlythe informationyou'rereallyinterestedin. Forinstance,fromagenericforumHTMLpage(thatis,theonecontainingthe message)we'dprobablywanttokeeponlythesender,thesubject,thedate andthemessagebody.Inthefullpage,instead,youcanusuallyfindimages, banners,HTMLcodefortablegenerationandsoon,untilyougetdocuments whichcanevenbecome100timesbigger(it'strue,I'vecountedthebytes!). However,ifwefindthatbeforeandafterthepiecesofinformationwe're interestedinthere'salwaysthesameHTMLcode,thenwecancreatewrappers whichcanrecognizeandextractthem. Ifwitholdhandmadewebsiteswecouldjustsupposebutwecouldn'tgive forgrantedsomekindofrepetitivityinsideHTMLcode,insidedynamicsites allthefixedcodeisSURELYalwaysidentical,becauseit'sgeneratedbya computer.Thishelpsusmuchwhenwecreateawrapper,becausewejusthave toseetheHTMLcodeofonegenericpage(forinstance,oneforummessage)to findallthepartswhichneverchange.

Insomecases,therewillbesomelittledifferencesbetween"particular"and "general"pages,butstillyou'lljusthavetoseeanexampleforeachoneof themandthenbesurethesamepatternswillworkinallothercases.Tomake itmoreclearwithanexample,thinkabout"unansweredforummessages"and "forummessageswhichhaveareply":theirpagesmightdifferslightly,so you'llhavetowatchbothofthem,butonceyouunderstandwhichcoderemains constant,you'llbeabletobuildwrapperswhichworkforbothofthesetwo pageclasses. AsItoldyoubefore,thankstothediffusionofmanystandardswecannow easilyfindpatternsnotonlyinsideawebsite,butalsofromonesiteto anotherone:thinkaboutnewsinRSSformat,oralltheblogswhicheven provideanAPItoaccessthem,orallthedifferentforumsoftwareswhichare writtenindifferentlanguagesbuthavethesamestructureandoftengenerate almostidenticalHTMLcode. Whenwe'refinishedidentifyingpatterns,wehavetoteachourbothowto replicatethem(inthecaseofbrowsingpatterns)ortodetectthem(inthe caseofwrappers).Insidethenexttwosectionsyou'llseesomethoughts, stillindependentfromprogramminglanguages,whichmighthelpyoutodesign yourbots.

5.2Websitenavigationwithbots Forwhatconcernsbrowsing,thesearetheoperationsabotshouldbeableto easilydo: givenanURL,downloadthematchingpage(thistask,asyou'llseeinthe nextchapter,istrivial,especiallyinPerl) linkextraction,basedonsomeparticularconditionsinsidethelinkURL: forinstance,thefilenamehastoendwith.gif,orallthefileshaveto beinside/textfiles/directoryorhavetobegeneratedbythephpscript viewthread.php?f=1&thread=...Thistaskcanbeeasilydoneusingregular expressions linkextraction,basedonsomeparticularconditionsinsidethetagged text:forinstance,wemightwanttochooseonlythelinkswhichareshown as"Next",ortheoneswhichstartwith"Document".Regularexpressions helpmuchinthiscasetoo followextractedlinksuntilaparticulardepthlevel:forinstance,follow allthelinksyoucanfindinalonglist,anddownloadalltheimagesyou

canfindinallofthevisitedwebsites followextractedlinksforever,oruntilaparticularcondition:for instance,continue"clickingonNext"untilthiskindoflinkispresent collectalltheextractedlinkstousetheminasecondtime,fromthelast visitedpageorfromallthedownloadedones:forinstance,itshouldbe possibletofollowlinksforthreelevelsofdepthandthencollectallthe linkstoimages,orfollowallthe"Next"onesinsideaforumand,forEACH pagewefound,collectallthelinkstomessages IfyoueverdecidetocreateaPerlbot,keepinmindthatmostofthis workhasbeendonebymeyet(orbysomeoneelseIdon'tknow,maybebefore me,probablybettertoo).Insidethechapterdevotedtoexamples,you'llfind thedescription(andthelinktosourcecode)ofapackage,called"common", whichcontainsexactlythefunctionsneededtocompletethesetasks.

5.3Dataextraction Forwhatconcernsdataextractionwithbots,thetechniqueI'vefoundmost easyandpowerfulistheonewhichusesRegularExpressions.Therearemany otherwaystoparseandextractinformationfromHTMLfiles,suchasXPathor textparsinglibraries.InthenextchapterI'llnametheselibs,butIwon't writeaboutthemindetail:ifyouwant,youcanfindmoreinformationabout theminside"SpideringHacks"...oraroundtheWeb. Regexpsallowyouto"cut"thetextslicingitinchunks:withaRElike /<initialconditions>.*?<finalconditions>/gsi where theinitialandfinalconditionsareHTMLcodewhichsurroundswhatyou areinterestedin .*?isaregularexpressionwhichmatches(thatis,whichissatisfied by)thesmallesttextchunkbetweenthetwoconditions thegsi"modifiers"allowyoutomakeaniterated(g),caseinsensitive (i)search,consideringallthetextasasingleline(s) youcan divideatextinchunks,forinstanceathreadinthemessageswhichare

partofitoralonglistofgenericelementsinthesingleelements (HEADTAILwrapper) extractoneormoretextstringsfromaline(LEFTRIGHTwrapper)

Sincewebsitesaresubjecttofrequentchanges(fortunately,thedynamically generatedonesaren'tupdatedsomuch),creatingyourwrappersparametrically isalwaysagoodchoice:thatis,theyshouldbeabletoextracttext dependingonsomestrings,whichyoucanchangefromtimetotime,containing theregularexpressionsyouwanttouse.Whenthewebsitechanges,you'll justhavetochangetheregexpandyourbotwillworkagain(foranexample, seetheoneaboutcinemasinsideChapter8).

==================================================================== ========= 6.PerlPowerBrowsingTools Insidethischapterwe'llspeakaboutPowerBrowsingtoolswritteninPerl. Here,you'lllearnhowtocreatebotswhichconnecttowebsites,downloading whatyouwantandextractingalltheinformationyouconsiderinteresting, andallofthiswhileautomaticallymanagingidentificationwithservers, authentication,cookiesandreferrers. ThefirstsectionofthischapterexplainswhyIchosePerlasaprogramming languageforbots.Inthefollowingones,you'llhaveaglimpseofwhichPerl librariesyoucanusetocreateWebcapableprograms,thenalittlemore indepthanalysisofthemainobjectsandmethodsyoucanusetocreateyour firstbots.

6.1.WhyPerl? WhyshouldyouusePerltocreatebots?Well,theexpressivepowerofmany programminglanguagesisalmostthesamenow,butofcourseeveryoneisstill differentfromothersbecauseofsomedetails,whichcanbemoreorless importantdependingonwhatkindofapplicationyouwanttocreate. It'soneofthesedetails(andthefactIlikethislanguagemuch)whichmade mechoosePerl:itsextremelypowerfultextmanipulationfunctionsletyou createinfewsecondswrapperswhich,withotherlanguages,mighthavebeen hardertobuild.

Tothiswehavetoaddthegreatavailabilityofreadymadelibraries,which helpsusmuch(asyou'llseelater)toevolveoursimpledataretrievalbots intoprogramswhich_share_informationinthemostcommonformats.Moreover, someoftheselibrariesaresoeasytousethateverybodywillbeableto programwiththem. Finally,weshouldconsidertwootherimportantPerlfeatures:firstofall, theportability,whichletsyoumakeexperimentswithbotsindependentlyfrom theoperatingsystemyouusethen,thepresenceofdocumentsandtutorials aboutthistopic.Imustsaythat"SpideringHacks",inthiscase,hasbeen fundamentalgivingmehints,newideasandgoodexampleswhosesourcecodeI usedhere:inparticular,youcanfindmostofthesourcesIusedhere,in theiroriginalform,inside"WebBasicswithLWPSampleRecipesforCommon Tasks"bySeanM.Burke,whichisavailableontheWebat http://perl.com/pub/a/2002/08/20/perlandlwp.html?page=1 Ofcourse,thesereasonsIusedtoexplainwhyIdecidedtousePerlshould notpreventyoufromusinganotherlanguage.Instead,I'dbegladtosee otherbotimplementationandcheckifthey'remoreorlesseasyorpowerful thantheonesI'lldescribeyoulater.

6.2.PerlPackages PerlhasmanylibrariesforWebaccessandforthemanagementofthemain objectsyourbotswillhavetodealwith.Evenifyou'llprobablyhavetouse onlyacoupleofthislibrariesfrequently(LWP::SimpleandLWP::UserAgent), here'sasmalldescriptionoftheothermorecommonlyusedones:

LWP Alsoknownaslibwwwperl,it'sacollectionofmodulesforWebaccess LWP::Simple It'stheeasiestpackageyoucanusetodownloadpagesfromtheWeb: betweenthefunctionsitexports,therearetheclassic"get"(whichyou havealreadyseeninsection4.4),"getprint",whichautomaticallyprints whatitdownloads,and"getstore",whichsavesdownloadeddocuments insideafile LWP::UserAgent It'samoreadvancedlibraryforWebaccess(we'llseeitmoreindetail later)

HTTP::Request HTTP::Response Theseareobjectsusedtomanageserverrequestsandresponseswitha higherlevelofdetail HTTP::Message HTTP::Headers TheseareclasseswhichoffermoremethodstoHTTP::Response URI It'saclasswhichexportsmethodstooperateonWebaddresses,for instancetomakearelativeURLabsolute,ortoextractdomainnameor pathelementsfromalink URI::Escape HTML::Entities Theseexportmethodsforescapingandunescaping,respectively,URLsand textextractedfromanHTMLdocumentm,allowingyourprogramstomanage nonstandardcharacterssuchasspacesinsideURLSorhyphenatedletters insideHTMLcode HTML::TokeParser,HTML::TreeBuilder,WWW::Mechanize ThefirsttwoclassesarespecializedinHTMLparsingandofferyouan alternativewaytoregularexpressionstoextractinformationfromWeb pagesthelastclassallowsyoutoautomatisetheinteractionwith websites,fillingforms,followinglinksandsoon.Togetherthese classesmighthelpyoumuchmakingyourbotsmoresimpleandstable,but (atleastnow)they'renotdescribedindetailinthistext.

6.3.LWP::Simple LWP::SimpleistheeasiestpackageyoumightusetoconnecttotheWebwith Perl.Itexportsthe"get"command,whichdownloadsthepageatthespecified URLandreturnsitscontent,readytobesavedinavariable.Thesyntaxis $content=get($URL) Onceinavariable,youcanapplyaregularexpressiontothecontent, checkingforsomeconditions:asanexamplenextscriptwill,infewlinesof code,tellyouifwhenyou'reexecutingitthere'saRadioBanditastream online. #=================================================================== =========

useLWP::Simple#thisisthepackageyouwanttouse my$url='http://radio.autistici.org/'#thisistheURLwe'llcheck #download$urlandsaveitscontentsin$content my$content=get($url) #exitifanerroroccurs die"Icouldn'tdownload$url"unlessdefined$content #nowsearchinside$content if($content=~m/bandita/i){ print"RadioBanditaisstreamingrightnow!\n" }else{ print"Nogoodmusiconline.\n" } #=================================================================== ========= Tohaveaslightlymoreuseful(andsurelymoreinternational)example,give alookatthismoreadvancedscript: #=================================================================== ========= useLWP::Simple#thisisthepackageyouwanttouse my$search_term=$ARGV[0]#gettheseachtermfromthecommandline die"Specifyasearchstring,please!\n"unlessdefined$search_term #thisisthebaseURLofthesearchscript,withthesearchtermappended my$url="http://www.shoutcast.com/directory/?s=$search_term" #download$urlandsaveitscontentsin$content my$content=get($url) #exitifanerroroccurs die"Icouldn'tdownload$url"unlessdefined$content #nowsearchinside$content if($content=~m/anySHOUTcaststreamsfound/){ print"Noshoutcaststreamscontainingtheterm$search_term\n" }else{ print"Someshoutcaststreamscontaintheterm$search_term.\n" print"Herearethesongsthey'replaying:\n" while($content=~/Playing:<\/font>\s(.*?)<\/font>/gs){ print$1."\n"

} } #=================================================================== ========= Inthisscript,wegetasearchstringfromthecommandline(aswedidwith theFlashlookaheadlinkextractorbefore)andthenweappendittoShoutcast searchengineURL:iftheanswercontainsthestring"anySHOUTcaststreams found",thennostreamscontainthesearchedwordotherwise,wecanjust applyasimpleleftrightwrappertothepagesourceandextractallthe songswhichareplayedbythestreamswhichcontainthesearchedterm. LWP::Simplegivesyousomemorefunctions,suchas"getprint"and"getstore", whoseusageismorespecificandforthisreasonlessfrequentthansimple get.However,getprintcanbeparticularlyusefulforonelinersincaseyou don'thavelynxinstalledinyoursystem:infact,thecommand perlMLWP::Simplee'getprint"http://3564020356.org"' worksthesamewayas lynxsourcehttp://3564020356.org

6.4.LWP::UserAgent LWP::UserAgentisamoreadvancedand,unfortunately,complexpackagethan LWP::Simple,butitoffersmanyotherfunctionswhichyoumightneedifyou decidetobuildacomplete,professionalbot.Thankstothispackage,you'll beabletoidentifyyourUserAgentwiththeserver,addthe"Referer"field betweenrequestheaders,POSTdatatoaform,managecookiesandconnect throughaproxy.Andtheseareonlysomeofthethingsyoucandowithit! LWP::UserAgent01Basics #=================================================================== ========= #!/usr/bin/perlw useLWP5.64#useLWPandcheckthattheversionisrecentenough my$url='http://radio.autistici.org/' #createanewuseragent my$ua=LWP::UserAgent>new #GETthespecifiedURL

my$response=$ua>get($url) #ifanerroroccurs,exit #(NOTEthemoreadvancederrormanagement) die"Ican'tdownload$url:",$response>status_line unless$response>is_success #now,searchinsidepagecontent if($response>content=~m/bandita/i){ print"RadioBanditaisstreamingrightnow!\n" }else{ print"Nogoodmusiconline.\n" } #=================================================================== ========= Thescriptyou'vejustseenistheeasiestprogramexampleyoucanwritewith LWP::UserAgent:it'sanevolutionofthepreviousscriptwhichwaswritten withLWP::Simple.It'snotmuchmorecomplexthantheoriginalone,butwe canseeamoreadvancedorganizationyet:firstofall,aUserAgentobject whichwillmanageallthecommunicationswiththeserveriscreatedthen, theUAisaskedtoGETthespecifiedURLandtosaveitinside$response variable.Thisvariabledoesn'tcontainonlytheHTMLpagecontentanymore, butit'sarealobjectwhichcantellyouiftheoperationwentfineornot (withis_successmethod)andgiveyouthepagecontent(withcontent)orthe errorcode(status_line). The"get"commanditself,usedinitseasiestversion,allowsyoutospecify manyotherparameters:forinstanceyoucanaddcustomheaders,containing theidentificationstringofyouUserAgent,thefiletypesyouaccept,the charactersetandthelanguageyouprefer.Todothis,youjusthavetosave thesedatainsideanarrayandtopassitasthesecondparameterof"get" method: LWP::UserAgent02Identifyyourbot #=================================================================== ========= my@ns_headers=( 'UserAgent'=>'MaLaBot1.0', 'Accept'=>'image/gif,image/xxbitmap,image/jpeg, image/pjpeg,image/png,*/*', 'AcceptCharset'=>'iso88591,*', 'AcceptLanguage'=>'enUS', ) $response=$browser>get($url,@ns_headers)

#=================================================================== ========= Somewebsitesmightrefusetoacceptdatafromyou,unlessyoucomefroma particularURL:tocheckthis,serversgivealookatyourreferer,soyou mightneedtochangeittoadesiredvalue.Todothis,youhavetoworkon "request"object:firstyoushouldcreateitspecifyingwhichkindofrequest (GETorPOST)youwanttosendthen,youshouldaddtherefererwiththe "referer"method.Thefollowingexampleshowsyouthestepsyouhavetodo whenyouPOSTdatatoaform(inthiscase,$POST_URLand$REFERER_URLare variableswhichcontain,respectively,theURLyouwanttosendyourrequest toandtheURLyou'vedecidedyouarecomingfrom). LWP::UserAgent03customreferers #=================================================================== ========= #createrequestobject,passingmethod my$req=HTTP::Request>new(POST=>"$POST_URL") #thisisusedforpost $req>content_type('application/xwwwformurlencoded') #linetopost #thisisanexamplestringwhichwillbesenttoasearchengine $req>content("lang=it&descrizione=$letters&numero=&CheckExt=N") #refererURL $req>referer("$REFERER_URL") my$res=$ua>request($req) #=================================================================== ========= Inthepreviousexampleyou'veseenhowtoPOSTformdata.Inparticular,we haveusedastring(theonespecifiedwiththe"content"methodofthe "Request"object)ofconcatenatedelements.However,there'sanotherwayto dothesamething:youcangivethe"post"methodofthe"UserAgent"object, togetherwiththedestinationURL,ahash(thatis,anarraycontaining key/valuepairs)withtheparametersyouwanttosend. LWP::UserAgent04postformdata #=================================================================== ========= my$ua=LWP::UserAgent>new my$url='http://url.of.the.form/path/form.php' my$response=$ua>post($url, ['param1'=>$value1,

'param2'=>$value2, 'param3'=>$value3, 'param4'=>$value4, ] ) #=================================================================== ========= Remember(IhadtobonkmyheadoverthemonitorforawhilebeforeIfound this)thatoften,afteraPOST,youmightberedirectedtoanotherpage.To haveyourUAreturnthepageyou'reinterestedin,andnottheonewhich justcontainstheredirectiontothepageyouwant,you'llhavetoinsert thefollowinglinebeforeyourrequest: push@{$ua>requests_redirectable},'POST' ThislinetellstheUserAgentthatPOSTSarerequestsforwhichthebotneeds toautomaticallyfollowredirects. Foryourauthenticationinsidesomewebsites(orevenforasimpleaccessto otherones)yourUserAgentshouldbeabletohandlecookies.Nothingeasier: youcanuseoneofthefollowingmethodsandactivateUAcookiemanagement. LWP::UserAgent05usecookies #=================================================================== ========= useHTTP::Cookies $ua>cookie_jar(HTTP::Cookies>new(#orHTTP::Cookies::Netscape 'file'=>'/some/where/cookies.lwp',#wheretoread/writecookies 'autosave'=>1,#saveittodiskwhendone )) # $kj=newHTTP::Cookies $kj>load("cookies.txt") $ua>cookie_jar($kj)#activatecookiejarfortheuseragent #=================================================================== ========= Last,butnotleast(ifyou'reworkinginanofficeandyou'rebehindaproxy theselineswillbeofprimaryimportanceforyou),theUAsyoucreatewith LWP::UserAgentcanconnectthroughproxies.Youcanusetheproxysetasan environmentvariable(env_proxy),soyoudon'thavetoreconfigureit,oryou cansetupanewone,andevenchoosetheprotocolsyouwanttouseitfor. LWP::UserAgent06connectthroughaproxy

#=================================================================== ========= $ua>env_proxy # $ua>proxy(['http','ftp'],'http://proxy.sn.no:8001/') #=================================================================== ========= That'sall,fornow:theseexamplesaren'tcompleteforsure,northeywant toanswerallthequestionsyoumightaskyourselveswhenyoudecidetobuild yourfirstbot.However,Ithinkthattheymightmakeyourlifeeasier,at leastforthefirstexperiments,andI'msurethatworkingonthemfora while(and,whynot,maybecooperatingwithsomeoneelse,ifnotwithme)you willgetgreatresults.

==================================================================== ========= 7.Shareyourdata Nowthatyoucancreatebotsthatcangoaroundrecoveringdataforyou,the bestthingyoucoulddoistofindawaytosharethesedatawithother people.Ofcourse,thisisachoicethateveryoneshoulddospontaneously, togetherwiththewayyoudecidetomakethissharing(fromfloppydiskto telepathy,everythingisallowed!) InthissectionI'dliketogivesomehintstothoseamongyouwhowouldlike tosharethedataharvestedwiththeirownbots,butstillhavenoideaof howtheycandoit. 7.1WebandRSS TheminimalsharingexampleIcanusuallythinkaboutisthescreenofmy computerintextmode,anideaIcanreplicatequitewellbycreatingWeb pageswhosecontentsareenclosedin<PRE>and</PRE>tags. Withouttheneedtoreachthisextremepoint,youcancreatelightwebsites, maybedynamicallygeneratedwithPHP,whichallowyoutoshareyourdata withoutlosingusability:alittlesearchengineandsomelinkstogofrom onepartofyourdatatoanotheronewillrequirearelativelysmalltime fortheimplementation,butthey'lladdmuchtotheusabilityofthedatayou expose. Alternatively,youcanusethePerlpackageXML::RSStopublishyourcontents

inRSS:whilethisoperationisabsolutelytrivialforyou,itgivesagreat servicetotheuser,lettinghimchoosetheclienthelikesmosttobrowse thedata.

7.2Mail WhenItalkaboutchoosingaclient,Ican'tbutthinkaboutmygood,old, comfortableandhyperusedmailclient:ifitcouldbeconsumed,minewouldn't evenexistanymorenow!It'sbecomesocomfortabletomethat,evenifit isn'tsupportedanymore,Istillcontinueusingit.And,beingaWindowsapp, it'soneofthefewreasonsthatstillmakemecreatedualbootPCs. So,ifotherusersarecomfortablewiththeirmailclientsasI'mwithmine, whydon'tweshareourdatawiththemthroughemail?Iremembersomeservices (nowthey'refewerandfewer,buttherestillaresome)thatallowedyouto seeWebpagesbymailothersperiodicallyvisitedawebsitetocheckfor updatesandsentnotificationstosubscribersIhavecreatedaperlscript that,calledbyprocmailwheneveritgetsamessagewithaparticular subject,sendsanemailwithalistofallthefilmsbeingplayedinmy city. Therearedifferentwaystocommunicatedataviaemail:one,perhapsmore complexifyoudon'thaveaserveravailable,istouseprocmailtofilter incomingmessagesandautomaticallyanswer,accordingtoparticulartext stringsinthesubjectorinthemessagebodyaontherone,instead,isto createavirtualPOP3serverliketheonedescribedinsection3.3.Ifyou wanttogivealookatsomeperlsourcecode,youcandownloadmyoldANO appfromhttp://3564020356.org/tools/ano099beta.zip.

7.3Net::Blogger Amongthevarioustechniquesyoucanusetopublishyourdata,oneofthe easiestandmostversatileonesisusingablog.It'seasybecauseyoujust havetouseareadymadelibrary,whichusesAPIsfrommostcommonblogs. It'sversatilebecause,whenyoupublishdataonablog,you'remakingthem availablenotonlyinHTML,butalsoinRSSformatyouhavethemsavedin aDBwhereyoucanmakequeriesalso,youprobablywon'tevenneedtoown theserverwhichhoststhem,becausetherearesomanywhichgiveyouafree blognowfinally,you'llbeabletousethesesatanicappsforaUSEFUL purpose,atlast! Iwillnotdwellonexplanationsaboutthispackage:youcanfindallthe

docsonlineandthesourceisselfexplanatory.Keepinmindthatthepart relatedtoblogpublishingisallinthelasttwentyrows,whileprevious onesdealwithvariabledeclarationandscrapercode(which,inthiscase, isthesameIusedinside"ThingsI'velearnedfromBMovies"example).If youchangethefirstlinesofthescript,insertingyourblog'sAPI'sURL, itsname,yourloginandpassword,you'llhaveinfewsecondsyourfirst workingBloggerbot. #=================================================================== ========= #!/usr/bin/perl useNet::Blogger#thisisusedtopostarticles usecommon#thisisusedforLWPrelatedfunctions(getpage,exturl) my$debug=1 my$PROXY='http://site.of.your.blog/blog/nucleus/xmlrpc/server.php' my$BLOG='myblog' my$LOGIN='login' my$PASS='password' my$TITLE="ThingsI'velearnedfromBmovies" my$CATEG="bot" # #thisisthescrapercode my$BADMOVIES_URL='http://www.badmovies.org/movies/' my$LINK_FORMAT='/movies/.*?/index.html' my@quotes

my$idx_content=getpage($BADMOVIES_URL) my@movies=exturl($idx_content,$LINK_FORMAT,'',$BADMOVIES_URL) my$movies_size=@movies my$randurl=$movies[rand($movies_size1)] my$mov_content=getpage($randurl) if($mov_content=~/<title>\s*Reviewfor(.*?)\s*\n/si){ $title=$1chomp$title } if($mov_content=~/learned\.gif>(.*?)<\/font><br>/si){ my$learned=$1 while($learned=~/10>\s*(.*?)\s*\n/gsi){ push@quotes,$1

} } my$quote_size=@quotes my$quote=$quotes[rand($quote_size1)] $DATA=qq|$quote<br>| $DATA.=qq|(<ahref="$randurl">$title</a>)| # #thisisthebloggercode $blogger=Net::Blogger>new(debug=>$debug) $blogger>Proxy($PROXY) $blogger>Username($LOGIN) $blogger>Password($PASS) #getblogidandassignittotheblogger my$blogid=$blogger>GetBlogId(blogname=>$BLOG) $blogger>BlogId($blogid) #createposttext my$txt="<title>$TITLE</title>" $txt.="<category>$CATEG</category>" $txt.="$DATA" #sendandpublishthenewpost my$id=$blogger>newPost(postbody=>\$txt,publish=>1) ||die"Error:".$b>LastError() #=================================================================== =========

7.4TWO TWO(TheWorkingOfflineforumreader)isn'taveryactiveprojectatthe moment,butIworkedalotonitlastyearandnowitenterswithfullmerit amongadvancedPowerBrowsingtechniques.Thankstoaverymodularstructure, basedonplugins,itallowsyoutodownloadthecontentsfromwebforumsthat havedifferenttechnologies(atthemomentfourdifferentforumtypesare supported,forwhichgooglereturnsmesomemillionsofhits),tosavethem insideaunifieddatabaseandtoreadmessageswithaWeb(HTMLandPHP) interfaceortosharethemviaWebServices(withreadymadeclients,evenif theyareminimal,inJava,PerlandC).

++++ |FORUM1||___UA___|++>HTML+PHP ||<>|wrapper1||| ++++|| |______|+>Java ++++ |<______>|| |FORUM2||___UA___|+>||+>WebService+>Perl ||<>|wrapper2|||DB||| ++++|<______>|+>C || ++++|V |FORUM3||___UA___|+REGISTRY ||<>|wrapper3|... ++++

OneofthemainadvantagesofTWO,beyondthepossibilitytosharedownloaded informationwithotherpeople,isthatitmergesdatacomingfromwebsites withdifferenttechnologiesinsideasingledatabase:inthiswayyoucan find,withasinglesearch,messageswrittenindifferentforums.Moreover, thediskspacerequiredbytheinformationextractedfromtheforumsisa verysmallpartofthedownloadeddata,whicharejustaverysmallpartof whatanormalbrowserwouldhavedownloaded. HowmuchdiskspacecansaveTWO,exactly?Thefollowingaretheresultsof sometests(youcanfindthecompleteversioninsideTWO'sdocumentation): Forumdatasize(KB)92741 TWO'sdatasize(KB)7892 Savedspace.(KB)........84849 Savedspace(perc)........91%<=!!! ============================== Thesavedspace,ofcourse,isnotonlyanadvantageforyou,butalsofor allthepeoplethatwilldownloadthatdatafromyourcomputer.Forinstance, ifpeoplestartedtoshareforums,youcouldjustcompressthat8MBdatabase toobtainafilethesizeofafloppydisk,soeveryoneinfewminutes(or seconds!)coulddownloaditandenrichtheirDBwithlotsofnewinformation. Usingamoreadvancedtechnology,ifmanyTWOWebServicesshareddifferent databasesthrougharegistry,everyWebusercouldmakedistributedsearches ondifferentforumsatthesametime. Ofcourse,TWOcanstillbeimprovedmuchandtheerrorsthatshouldbe

correctedare,probably,stillmany.Thesourcecodeisprovided"asis"and you'llprobablyneedsometimetounderstandhowitworksandhowitcanbe improved.However,ifyouareinterested,youcandownloadTWO'ssource codeanddocumentationfromhttp://two.sf.net.Letmeknowifyoucanmake somethinggoodoutofit)

==================================================================== ========= 8.Examples Inthissectionyou'llhavethechancetoseeandtrysomeexamples.Tosave space,I'vedecidednottopublishtheirsourcecodehere,buttoinserta linkfromwhichyoucandownloadthem.Ifyoucan'tconnectthere,youcan sendmeamailandI'llansweryouwithanalternativeURL. Commonlib http://3564020356.org/cgibin/perlcode.pl?file=common.pm common.pmisapackageIcreatedwhenIwasdevelopingTWO.Itcontainsall themainfunctionsIusedtocontrolawebbotbehavior: getpageisverysimilartoLWP::Simple"get"command,butituses LWP::UserAgentinsteadtoobtainsomemoreadvantages:itcanmanage cookies,aproxy,UserAgentidentificationandmultipleretriesbefore itabortsadownload. exturlcollects,insideanarray,allthelinksitcanfindinsideaWeb pageandwhichsatisfyoneormoreregularexpressionsinsidetheURLor thetaggedtext:thisallowsyoutofollowlinkssuchas"allthefiles whosenameendin.txt"or"allthelinkswhosetextmatches'Next'". walkpagesisarecursivefunctionwhichusesexturltofollowalistof linksandcollectalistofothers.Itcanworkindifferentways:itcan followdifferentlinksdependingonthedepthanditcancollectonlythe onesitfindsinthelastpage,oralltheonesitfindsduringits travel. walkpages_loopisthe"looped"versionofwalkpages:thatis,itfollows thesame(orthesamelistof)linksundefinitely(orforaspecified depth)untilitfindsresults,collectingallthematchinglinksit finds. Cinemaz http://3564020356.org/cgibin/perlcode.pl?file=cinema.pl

NOTE:itusescommon.pm CinemazisaprogramIcreatedforpersonaluse,toautomaticallygather datafromhttp://www.monzacinema.it.Thiswebsiteshowsallthemoviesyou canfindthisweekindifferentcinemas,butifyouwanttochoosea particularmovie,knowwhereandwhenyouhavetogotoseeit,andfind aphonenumbertobookaseat,wellyouhavetoclickfartoomanytimes! So,thisbotconnectstothemainpageanddownloadsallthepageswhich describethecinemas,extractingtheirname,thephonenumber,themovie nameandthetimetable.Alltheinformationareshowninagoodoldplain textfile,whichhaseverythingyouneed...andonlyit. WithsomelittleadaptationsImanagedtousethesamescriptwithprocmail andnow,whereverIam,Ijustneedtosendmybotamailwith"cinemaz"as subjecttohaveananswercontainingtheverysametextfile:)

ThingsI'velearnedfromBMovies http://3564020356.org/cgibin/perlcode.pl?file=badmovies.pl NOTE:itusescommon.pm http://www.badmovies.comisaveryfunnywebsite,containingmanyBmovies reviews.Oneofthefunniestthings,IMO,isthatinsideeverymoviepage there'sasection,called"ThingsI'velearnedfromBMovies",withalist offortunesabout(silly)thingsyoucanlearnfromthatmovie. Whenyourunthescriptitdownloadsthemovielistwiththeirlinks,it choosesandfollowsarandomone,thenitextractallthequotesfromthe "ThingsI'velearnedfromBMovies"sectionandfinally,dependingonhow youranit,itshowsallofthemorjustarandomone,likethe"fortune" application.

Malacomix http://3564020356.org/cgibin/perlcode.pl?file=comics.pl Malacomixconnectstohttp://www.comics.comandletsyousee,everysingle day,yourfavoritecomicstrips.Sinceitusesthecommonsyntaxusedby thewebsitetofindallthedifferentstrips,youjusthavetowriteinside theURLthenamesofthestripsyouwanttoseeanditwillgeneratean HTMLpagecontainingallthelinkstothematchingimages.

Happy3URLExtractor http://3564020356.org/cgibin/perlcode.pl?file=happy3.pl

Thisscripthasbeencreatedwithaparticularpurpose:toleteveryonesee HappyTreeFriendsepisodes_thewaytheylike_(andnotinafixedsize popupwindow)ordownloadthemontheirharddisk.Thescriptisvery,very easy,butitfollowedamoreadvanced"flashreversing"work.Maybeone day,whenyouarenotsotiredbecauseyou'vereadthisloooongtext,I'll explainyouthisonetoo)

Das könnte Ihnen auch gefallen