Scrapyを中断するには

Scrapyのクローリングは失敗があっても統計情報に記録しつつ最後まで実行する。ページの取得失敗やパイプラインでの処理失敗などで処理を中断したい場合は適切な例外をスローする必要がある。
例外はBuilt-in Exceptions referenceで示されている。

Spiderでの例外

典型的なサンプルがExceptionsのリファレンスに記載されている。response.bodyに帯域超過を示すメッセージがあれば、CloseSpiderをスローして終了するサンプル。

1
2
3
def parse_page(self, response):
if 'Bandwidth exceeded' in response.body:
raise CloseSpider('bandwidth_exceeded')

ItemPipelineでの例外

典型的なサンプルがItemPipelineのリファレンスに記載されている。Itemにpriceという項目が無ければDropItemをスローして終了するサンプル。

1
2
3
4
5
6
7
8
9
10
11
12
13
from scrapy.exceptions import DropItem

class PricePipeline:

vat_factor = 1.15

def process_item(self, item, spider):
if item.get('price'):
if item.get('price_excludes_vat'):
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)

例外を発生させたときの挙動

公式チュートリアルのQuotesSpiderをカスタマイズして、Spiderを中断する例外をスローする。
コールバックのparser()はただ中断される。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy
from scrapy.exceptions import CloseSpider

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
raise CloseSpider("Force Close!!!!!!")
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

中断された場合、finish_reasonに指定したエラーメッセージが設定される。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Spider opened
2020-05-18 XX:XX:XX [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-18 XX:XX:XX [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-18 XX:XX:XX [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-18 XX:XX:XX [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Closing spider (Force Close!!!!!!)
2020-05-18 XX:XX:XX [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 455,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2719,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.21889,
'finish_reason': 'Force Close!!!!!!',
'finish_time': datetime.datetime(2020, 5, 18, xx, xx, xx, XXXXXX),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'memusage/max': 55631872,
'memusage/startup': 55631872,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 5, 18, xx, xx, xx, XXXXXX)}
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Spider closed (Force Close!!!!!!)

Using errbacks to catch exceptions in request processing

Requestプロセスの中で発生した例外はerrbackでその挙動を定義することができる。
ネットワークに関する典型的な例外をトラップする例が記載されている。サンプルではログ出力のみだが、前述の例外をスローして中断する処理を記述することができる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import scrapy

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError

class ErrbackSpider(scrapy.Spider):
name = "errback_example"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Not found error
"http://www.httpbin.org/status/500", # server issue
"http://www.httpbin.org:12345/", # non-responding host, timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]

def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)

def parse_httpbin(self, response):
self.logger.info('Got successful response from {}'.format(response.url))
# do something useful here...

def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))

# in case you want to do something special for some errors,
# you may need the failure's type:

if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)

elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)

elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)

SplashのLuaスクリプトの例外を処理する

SplashRequestから実行したLuaスクリプト内でerror()を使って強制的にエラーを発生させている。Luaスクリプト内のエラーはSplashからHTTPのエラーコード400による応答でScrapyへ返却される。

ScrapyはSplashRequestに設定したerrbackでこのエラーをトラップし、CloseSpider例外を発生させてSpiderを中断する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy_splash_tutorial.items import QuoteItem
from scrapy.spidermiddlewares.httperror import HttpError
from scrapy.exceptions import CloseSpider

_login_script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))

-- 例外発生
error("Force Splash Error")

return {
url = splash:url(),
html = splash:html(),
}
end
"""

class QuotesjsSpider(scrapy.Spider):
name = 'quotesjs'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(
url,
callback=self.parse,
errback=self.errback_httpbin,
endpoint='execute',
cache_args=['lua_source'],
args={ 'timeout': 60, 'wait': 5, 'lua_source': _login_script, },
)

def parse(self, response):
for q in response.css(".container .quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote

def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))

# in case you want to do something special for some errors,
# you may need the failure's type:

if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
raise CloseSpider("Force Close!!!!!!")

Splashで400 Bad request to Splashエラーなり、errbackCloseSpider例外を発生させ終了している。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
2020-05-18 XX:XX:XX [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-18 XX:XX:XX [scrapy.core.engine] DEBUG: Crawled (404) <GET http://splash:8050/robots.txt> (referer: None)
2020-05-18 XX:XX:XX [scrapy_splash.middleware] WARNING: Bad request to Splash: {'error': 400, 'type': 'ScriptError', 'description': 'Error happened while executing Lua script', 'info': {'source': '[string "..."]', 'line_number': 7, 'error': 'Force Splash Error', 'type': 'LUA_ERROR', 'message': 'Lua error: [string "..."]:7: Force Splash Error'}}
2020-05-18 XX:XX:XX [scrapy.core.engine] DEBUG: Crawled (400) <GET http://quotes.toscrape.com/js/ via http://splash:8050/execute> (referer: None)
2020-05-18 XX:XX:XX [quotesjs] ERROR: <twisted.python.failure.Failure scrapy.spidermiddlewares.httperror.HttpError: Ignoring non-200 response>
2020-05-18 XX:XX:XX [quotesjs] ERROR: HttpError on http://quotes.toscrape.com/js/
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Closing spider (Force Close!!!!!!)
2020-05-18 XX:XX:XX [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1282,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 1134,
'downloader/response_count': 3,
'downloader/response_status_count/400': 1,
'downloader/response_status_count/404': 2,
'elapsed_time_seconds': 2.439215,
'finish_reason': 'Force Close!!!!!!',
'finish_time': datetime.datetime(2020, 5, 18, xx, xx, xx, XXXXXX),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 10,
'log_count/WARNING': 2,
'memusage/max': 56270848,
'memusage/startup': 56270848,
'response_received_count': 3,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/404': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/execute/request_count': 1,
'splash/execute/response_count/400': 1,
'start_time': datetime.datetime(2020, 5, 18, 6, 5, xx, xx, xx, XXXXXX)}
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Spider closed (Force Close!!!!!!)