Sie sind auf Seite 1von 6

How to download a file with Scrapy?

1.Way

Save the pdf in the spider callback:


defparse_listing(self,response):
#...extractpdfurls
forurlinpdf_urls:
yieldRequest(url,callback=self.save_pdf)

defsave_pdf(self,response):
path=self.get_path(response.url)
withopen(path,"wb")asf:
f.write(response.body)
2.Way

Do it in pipeline :
#inthespider
defparse_pdf(self,response):
i=MyItem()
i['body']=response.body
i['url']=response.url
#youcanaddmoremetadatatotheitem

returni

#inyourpipeline
defprocess_item(self,item,spider):
path=self.get_path(item['url'])
withopen(path,"wb")asf:
f.write(item['body'])
#removebodyandaddpathasreference
delitem['body']
item['path']=path
#letitembeprocessedbyotherpipelines.ie.dbstore
returnitem
3.Way

Use the filepiple(https://groups.google.com/forum/print/msg/scrapyusers/kzGHFjXywuY/O6PIhoT3thsJ)


1) download https://raw.github.com/scrapy/scrapy/master/scrapy/contrib/pipeline/files.py
and save it somewhere in your Scrapy project,
lets say at the root of your project (but thats not the best location)
yourproject/files.py

2) then, enable this pipeline by adding this to your settings.py


ITEM_PIPELINES=[
'yourproject.files.FilesPipeline',

]
FILES_STORE='/path/to/yourproject/downloads'

FILES_STORE needs to point to a location where Scrapy can write (create it beforehand)
3) add 2 special fields to your item definition
file_urls=Field()
files=Field()

4) in your spider, when you have an URL for a file to download,


add it to your Item instance before returning it
...
myitem=YourProjectItem()
...
myitem["file_urls"]=["http://www.example.com/somefileiwant.csv"]
yieldmyitem

5) run your spider and you should see files in the FILES_STORE folder
Heres an example that download a few files from the IETF website
the scrapy project is called filedownload
items.py looks like this:
fromscrapy.itemimportItem,Field

classFiledownloadItem(Item):
file_urls=Field()

files=Field()

this is the code for the spider:


fromscrapy.spiderimportBaseSpider
fromfiledownload.itemsimportFiledownloadItem

classIetfSpider(BaseSpider):
name="ietf"
allowed_domains=["ietf.org"]
start_urls=(
'http://www.ietf.org/',
)

defparse(self,response):
yieldFiledownloadItem(
file_urls=[
'http://www.ietf.org/images/ietflogotrans.gif',
'http://www.ietf.org/rfc/rfc2616.txt',
'http://www.rfceditor.org/rfc/rfc2616.ps',
'http://www.rfceditor.org/rfc/rfc2616.pdf',
'http://tools.ietf.org/html/rfc2616.html',
]

When you run the spider, at the end, you should see in the console something like this:
2013092118:30:42+0200[ietf]DEBUG:Scrapedfrom<200http://www.ietf.org/>
{'file_urls':['http://www.ietf.org/images/ietflogotrans.gif',
'http://www.ietf.org/rfc/rfc2616.txt',
'http://www.rfceditor.org/rfc/rfc2616.ps',
'http://www.rfceditor.org/rfc/rfc2616.pdf',
'http://tools.ietf.org/html/rfc2616.html'],
'files':[{'checksum':'e4b6ca0dd271ce887e70a1a2a5d681df',
'path':
'full/4f7f3e96b2dda337913105cd751a2d05d7e64b64.gif',
'url':'http://www.ietf.org/images/ietflogotrans.gif'},
{'checksum':'9fa63f5083e4d2112d2e71b008e387e8',
'path':
'full/454ea89fbeaf00219fbcae49960d8bd1016994b0.txt',
'url':'http://www.ietf.org/rfc/rfc2616.txt'},
{'checksum':'5f0dc88aced3b0678d702fb26454e851',
'path':
'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps',
'url':'http://www.rfceditor.org/rfc/rfc2616.ps'},
{'checksum':'2d555310626966c3521cda04ae2fe76f',
'path':
'full/6ff52709da9514feb13211b6eb050458f353b49a.pdf',

'url':'http://www.rfceditor.org/rfc/rfc2616.pdf'},
{'checksum':'735820b4f0f4df7048b288ba36612295',
'path':
'full/7192dd9a00a8567bf3dc4c21ababdcec6c69ce7f.html',
'url':'http://tools.ietf.org/html/rfc2616.html'}]}
2013092118:30:42+0200[ietf]INFO:Closingspider(finished)

which tells you what files were downloaded, and where they were stored.

1Files Pipeline
2 Files Pipeline
Note: Scrapy version 0.22.2 has already contain the file.py,use it directly:from
scrapy.contrib.pipeline.files import FilesPipeline, FSFilesStore
Final question : How to save a file to mongodb directly?
Reference:
https://stackoverflow.com/questions/7123387/scrapy-define-a-pipleine-to-save-files
https://groups.google.com/forum/#!msg/scrapy-users/kzGHFjXywuY/O6PIhoT3thsJ

Das könnte Ihnen auch gefallen