Scrapping The Web

Web Scrapping with Python
Miguel Miranda de Mattos

:@mmmattos - mmmattos.net
Porto Alegre, Brazil.
2012
Web Scrapping with Python
Tools:
BeautifulSoup
Mechanize
BeautifulSoup
An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
In Summary:
Navigate the "soup" of HTML/XML tags,
programatically
Access tags properties and values
Search for tags and their attributes.
BeautifulSoup
Example:
from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup(doc)
print soup.prettify()
# <html>
# <h1>
# Heading
# </h1>
# <p>
# Text
# </p>
# </html>
BeautifulSoup
Searching / Looking for things
'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
'findPreviousSiblings'
findAll
findAll(self, name=None, attrs={}, recursive=True,
text=None, limit=None, **kwargs)
Extracts a list of Tag objects that match the given
criteria. You can specify the name of the Tag and any
attributes you want the Tag to have.
BeautifulSoup
Example:
>>> from BeautifulSoup import BeautifulSoup
>>> doc = "<table><tr><td>one</td><td>two</td></tr></table>"
>>> docSoup = BeautifulSoup(doc)

>>> print docSoup.findAll('tr')
[<tr><td>one</td><td>two</td></tr>]
>>> print docSoup.findAll('td')
[<td>one</td>, <td>two</td>]
BeautifulSoup
findAll (contd.):
>>> for t in docSoup.findAll('td'):
>>> print t
<td>one</td>
<td>two</td>
>>> for t in docSoup.findAll('td'):
>>> print t.getText()
one
two
BeautifulSoup
findAll using attributes to qualify:
>>> soup.findAll('div',attrs = {'class': 'Menus'})
[<div>musicMenu</div>,<div>videoMenu</div>]
For more options:
dir (BeautifulSoup)
help (yourSoup.<command>)
Use BeautifulSoup rather than regexp patterns:
patFinderTitle = re.compile(r'<a[^>]*\stitle="(.*?)"')
re.findAll(patFinderTitle, html)
by
soup = BeautifulSoup(html)
for tag in brand_row_soup.findAll('a'):
print tag['title']
Mechanize
Stateful programmatic web browsing in Python, after
Andy Lesters Perl module.
mechanize.Browser and mechanize.UserAgentBase, so:
any URL can be opened, not just http:
mechanize.UserAgentBase offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and robots.
txt handling, without having to make a new OpenerDirector each
time, e.g. by callingbuild_opener().
Easy HTML form filling.
Convenient link parsing and following.
Browser history (.back() and .reload() methods).
The Referer HTTP header is added properly (optional).
Automatic observance of robots.txt.
Automatic handling of HTTP-Equiv and Refresh.
Mechanize
Navigation commands:
open(url)
follow_link(link)
back()
submit()
reload()
Examples
br = mechanize.Browser()
br.open("python.org")
gothtml = br.response().read()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()
Mechanize
Example:
import re
import mechanize
br.open("http://www.example.com/")
# follow second link with element text matching
# regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop")
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body
Mechanize
Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
from BeautifulSoup import BeutifulSoup
url = "http://www.hp.com"
br..open(url)
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
print d
Mechanize
Example: Combining Mechanize and BeautifulSoup
import re
import mechanize
url = "http://www.hp.com"
br..open(url)
html = br.response().read()
result_soup = BeautifulSoup(html)
found_divs = soup.findAll('div')
print "Found " + str(len(found_divs))
for d in found_divs:
if d.has_key('class'):
print d['class']

Scrapping The Web

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Scrapping The Web

Hochgeladen von

Copyright:

Verfügbare Formate

Web Scrapping with Python

Miguel Miranda de Mattos

Das könnte Ihnen auch gefallen