Beruflich Dokumente
Kultur Dokumente
<tr>
<td align="center" valign="top" width="150" bgcolor="FF9933" style="word-wrap: break-word">
4. Related Work
<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=321865836">
iLUVKutsANdBURses</a><br><br>
<a href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=321865836">
<img src="http://a181.ac-
images.myspacecdn.com/images01/113/s_53fde2989fb50a6c872cfc2a2164a57c.jpg"
Rendered Content
Blocking IP addresses [3] stops both legitimate and
border="0"/></a><br><br /><br><br>
</td>
<td bgcolor="F9D6B4" align="left" valign="top" width="260" style="word-wrap: break-word"
class="columnsWidening"><span class="blacktext10">Jul 12 2008 10:35 PM</span>
illegitimate extraction from a (set of) IP address(es) and
<br><br> U R THE BEST SINGER IN THE WORLD
</td>
/t thus is not suitable when access is process-focused. Also,
HTML Code
the random insertion of CAPTCHA [4] in scraping is
Figure 3. Typical Web Scraping
common practice, which differs from our approach; in that
our approach is completely transparent to legitimate users.
3. System
Paid content sites (e.g. age appropriate entertainment)
Figure 4 illustrates how RID technology may be included have a legacy of anti-crawling. These sites control access
into the typical web server environment. In order to by means of user login. Other sites that make "small
optimize the process of generating the dynamic HTTP changes" or "hidden random strings" do not make their
request-result, the web-server 'pre-generates' the set of mechanisms "completely hidden" from end users.
redundant code, and simply 'injects' randomly selected
code into the base code at appropriate points (where the 5. Conclusion
data needs to be hidden from web-scrapers).
The proposed system provides a way to disable web
scraping by obfuscating the code rendered to the Web
Content Code
Generator
HTTP Request &
Response
Web-Scraper
XContent
Harvester
client, such that although the rendered webpage (as seen
on the screen by the end-user) is unchanged the code
behind the webpage is changed (dynamically) upon every
fetch request. This code-poisoning technique ensures that
RID
Code
no automated agent can reliably collect data from the
Injection website, thus rendering the extraction agent ineffective. In
Injection the end, the data consumer has no choice but to adopt the
Database HTTP Request &
Response End User content provider’s Web Services to gain access to the
Rendered Content
HTTP WebServer
data.
<tr><tr><td><td></td></td></tr>
<td align="center" valign="top" width="150" bgcolor="FF9933" style="word-wrap: break-
word">
…..<a
References
href="http://profile.myspace.com/index.cfm?fuseaction=user.viewprofile&friendid=321865836"
>
<img src="http://a181.ac-
images.myspacecdn.com/images01/113/s_53fde2989fb50a6c872cfc2a2164a57c.jpg"
border="0"/> </a><br><br /><br><br>
[1] Web Scraping
</td><td></td><td align=“center” valign=“top” width=“150”><a href=“blank.jpg”></a></td>
<td bgcolor="F9D6B4" align="left" valign="top" width="260" style="word-wrap: break-word"
class="columnsWidening"><span class="blacktext10">Jul 12 2008 10:35 PM</span>
http://en.wikipedia.org/wiki/Web_scraping
<br><br> U R THE BEST SINGER IN THE WORLD
/d
[2] Alfredo Alba, Varun Bhagwan, Tyrone Grandison:
Random Injection of HTML Code
Accessing the deep web: when good ideas go bad.
OOPSLA Companion 2008: 815-818
Figure 4. Random Injection based Deactivation
[3] IP Blocking,
(RID) of Web-Scrapers
http://en.wikipedia.org/wiki/Wikipedia:Blocking_IP_addr
As there is no change to the display on the screen, the end esses
user experience remains the same, thus achieving the [4] Luis von Ahn, Manuel Blum, Nicholas J. Hopper and
technology’s objective of providing completely John Langford. "CAPTCHA: Using Hard AI Problems for
transparent and non-intrusive deactivation of web- Security". The proceedings of Eurocrypt 2003. Warsaw,
scraping applications and services. Poland, May 4–8, 2003. LNCS Vol 2656/2003.
1015