Sie sind auf Seite 1von 4

1/2/2018 Using Scrapy in Jupyter notebook | JJ's World

Go back

Using Scrapy in Jupyter notebook


Wed 02 August 2017

This notebook makes use of the Scrapy (https://scrapy.org) library to scrape data from a website. Following the basic example, we create a QuotesSpider and call the
CrawlerProcess with this spider to retrieve quotes from http://quotes.toscrape.com (http://quotes.toscrape.com).

In this notebook two pipelines are de ned, both writing results to a JSON le. The rst option is to create a separate class that de nes the pipeline and explicitly has the
functions to write to a le per found item. It enables more exibility when dealing with stranger data formats, or if you want to setup a custom way of writing items to le.
The pipeline is set in the custom_settings parameter ITEM_PIPELINES inside the QuoteSpider class. However, I simply want to write the list of items that are found in
the spider to a JSON le and therefor it is easier to choose the second option, where only the FEED_FORMAT has to be set to JSON and the output le needs to be
de ned in FEED_URI inside the custom settings of the spider. No additional classes or de nitions need to be created, making the FEED_FORMAT/FEED_URI a
convenient option.

Once the quotes are retrieved the JSON le will be created on disk and can be loaded to a Pandas dataframe. This dataframe can then be analyzed, modi ed and be
used for further processing. This notebook simply loads the JSON le to a dataframe and writes it again to a pickle.

In [1]: # Settings for notebook


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Show Python version
import platform
platform.python_version()

Out[1]: '3.6.1'

## Import Scrapy
In [2]: try:
import scrapy
except:
!pip install scrapy
import scrapy
from scrapy.crawler import CrawlerProcess

## Setup a pipeline

This class creates a simple pipeline that writes all found items to a JSON le, where each line contains one JSON element.

In [3]: import json

class JsonWriterPipeline(object):

def open_spider(self, spider):


self.file = open('quoteresult.jl', 'w')

def close_spider(self, spider):


self.file.close()

def process_item(self, item, spider):


line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item

## De ne the spider

The QuotesSpider class de nes from which URLs to start crawling and which values to retrieve. I set the logging level of the crawler to warning, otherwise the notebook
is overloaded with DEBUG messages about the retrieved data.

https://www.jitsejan.nl/using-scrapy-in-jupyter-notebook.html 1/4
1/2/2018 Using Scrapy in Jupyter notebook | JJ's World

In [4]: import logging

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
'FEED_FORMAT':'json', # Used for pipeline 2
'FEED_URI': 'quoteresult.json' # Used for pipeline 2
}

def parse(self, response):


for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}

## Start the crawler


In [5]: process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(QuotesSpider)
process.start()

2017-08-02 15:22:02 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)


2017-08-02 15:22:02 [scrapy.utils.log] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}

Out[5]:

## Check the les

Verify that the les has been created on disk. As we can observe the les are both created and have data. The .jl le has line separated JSON elements, while the .json
le has one big JSON array containing all the quotes.

In [6]: ll quoteresult.*

-rw-rw-r-- 1 jitsejan 5551 Aug 2 15:22 quoteresult.jl


-rw-rw-r-- 1 jitsejan 5573 Aug 2 15:22 quoteresult.json

In [7]: !tail -n 2 quoteresult.jl

{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags":
["misattributed-eleanor-roosevelt"]}
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}

In [8]: !tail -n 2 quoteresult.json

{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}
]

## Create dataframes

Pandas can now be used to create dataframes and save the frames to pickles. The .sjon le can be loaded directly into a frame, whereas for the .jl le we need to
specify the JSON objects are divided per line.

https://www.jitsejan.nl/using-scrapy-in-jupyter-notebook.html 2/4
1/2/2018 Using Scrapy in Jupyter notebook | JJ's World

In [9]: import pandas as pd


dfjson = pd.read_json('quoteresult.json')
dfjson

Out[9]:
author tags text

0 Marilyn Monroe [friends, heartbreak, inspirational, life, lov... “This life is what you make it. No matter what...

1 J.K. Rowling [courage, friends] “It takes a great deal of bravery to stand up ...

2 Albert Einstein [simplicity, understand] “If you can't explain it to a six year old, yo...

3 Bob Marley [love] “You may not be her first, her last, or her on...

4 Dr. Seuss [fantasy] “I like nonsense, it wakes up the brain cells....

5 Douglas Adams [life, navigation] “I may not have gone where I intended to go, b...

6 Elie Wiesel [activism, apathy, hate, indifference, inspira... “The opposite of love is not hate, it's indiff...

7 Friedrich Nietzsche [friendship, lack-of-friendship, lack-of-love,... “It is not a lack of love, but a lack of frien...

8 Mark Twain [books, contentment, friends, friendship, life] “Good friends, good books, and a sleepy consci...

9 Allen Saunders [fate, life, misattributed-john-lennon, planni... “Life is what happens to us while we are makin...

10 Albert Einstein [change, deep-thoughts, thinking, world] “The world as we have created it is a process ...

11 J.K. Rowling [abilities, choices] “It is our choices, Harry, that show what we t...

12 Albert Einstein [inspirational, life, live, miracle, miracles] “There are only two ways to live your life. On...

13 Jane Austen [aliteracy, books, classic, humor] “The person, be it gentleman or lady, who has ...

14 Marilyn Monroe [be-yourself, inspirational] “Imperfection is beauty, madness is genius and...

15 Albert Einstein [adulthood, success, value] “Try not to become a man of success. Rather be...

16 André Gide [life, love] “It is better to be hated for what you are tha...

17 Thomas A. Edison [edison, failure, inspirational, paraphrased] “I have not failed. I've just found 10,000 way...

18 Eleanor Roosevelt [misattributed-eleanor-roosevelt] “A woman is like a tea bag; you never know how...

19 Steve Martin [humor, obvious, simile] “A day without sunshine is like, you know, nig...

In [10]: dfjl = pd.read_json('quoteresult.jl', lines=True)


dfjl

Out[10]:
author tags text

0 Marilyn Monroe [friends, heartbreak, inspirational, life, lov... “This life is what you make it. No matter what...

1 J.K. Rowling [courage, friends] “It takes a great deal of bravery to stand up ...

2 Albert Einstein [simplicity, understand] “If you can't explain it to a six year old, yo...

3 Bob Marley [love] “You may not be her first, her last, or her on...

4 Dr. Seuss [fantasy] “I like nonsense, it wakes up the brain cells....

5 Douglas Adams [life, navigation] “I may not have gone where I intended to go, b...

6 Elie Wiesel [activism, apathy, hate, indifference, inspira... “The opposite of love is not hate, it's indiff...

7 Friedrich Nietzsche [friendship, lack-of-friendship, lack-of-love,... “It is not a lack of love, but a lack of frien...

8 Mark Twain [books, contentment, friends, friendship, life] “Good friends, good books, and a sleepy consci...

9 Allen Saunders [fate, life, misattributed-john-lennon, planni... “Life is what happens to us while we are makin...

10 Albert Einstein [change, deep-thoughts, thinking, world] “The world as we have created it is a process ...

11 J.K. Rowling [abilities, choices] “It is our choices, Harry, that show what we t...

12 Albert Einstein [inspirational, life, live, miracle, miracles] “There are only two ways to live your life. On...

13 Jane Austen [aliteracy, books, classic, humor] “The person, be it gentleman or lady, who has ...

14 Marilyn Monroe [be-yourself, inspirational] “Imperfection is beauty, madness is genius and...

15 Albert Einstein [adulthood, success, value] “Try not to become a man of success. Rather be...

16 André Gide [life, love] “It is better to be hated for what you are tha...

17 Thomas A. Edison [edison, failure, inspirational, paraphrased] “I have not failed. I've just found 10,000 way...

18 Eleanor Roosevelt [misattributed-eleanor-roosevelt] “A woman is like a tea bag; you never know how...

19 Steve Martin [humor, obvious, simile] “A day without sunshine is like, you know, nig...

In [11]: dfjson.to_pickle('quotejson.pickle')
dfjl.to_pickle('quotejl.pickle')

https://www.jitsejan.nl/using-scrapy-in-jupyter-notebook.html 3/4
1/2/2018 Using Scrapy in Jupyter notebook | JJ's World

In [12]: ll *pickle

-rw-rw-r-- 1 jitsejan 5676 Aug 2 15:22 quotejl.pickle


-rw-rw-r-- 1 jitsejan 5676 Aug 2 15:22 quotejson.pickle

notebook (./tag/notebook.html) Python (./tag/python.html) Jupyter (./tag/jupyter.html) Scrapy (./tag/scrapy.html) crawling (./tag/crawling.html)

Go back

© JJ's World (.) | Powered by Pelican (http://getpelican.com/) | Hosted on SSDNodes (https://www.ssdnodes.com) | 2008 - 2017

https://www.jitsejan.nl/using-scrapy-in-jupyter-notebook.html 4/4

Das könnte Ihnen auch gefallen