Signals

Scrapy uses signals extensively to notify when certain events occur. You can catch some of those signals in your Scrapy project (using an extension, for example) to perform additional tasks or extend Scrapy to add functionality not provided out of the box.

Scrapyはイベント発生時にシグナルを使って処理を拡張することができる。
リファレンスに示された以下のサンプルコードは、signals.spider_closedシグナルをキャッチしたときにspider_closed()メソッドを実行する例だ。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from scrapy import signals
from scrapy import Spider


class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]


@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(DmozSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider


def spider_closed(self, spider):
spider.logger.info('Spider closed: %s', spider.name)


def parse(self, response):
pass

クロール終了時に統計情報を通知する

公式チュートリアルのQuotesSpiderをカスタマイズして、クロール終了時に統計情報へアクセスする例。
signals.spider_closedシグナルをキャッチしたときに統計情報を通知するなどの処理を実装することができる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import scrapy
from scrapy import signals

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(QuotesSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider

def spider_closed(self, spider):
spider.logger.info('Spider closed!!!!!!!!!: %s', spider.name)
spider.logger.info(spider.crawler.stats.get_stats())

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

シグナル処理によって出力された結果は以下。

1
2
3
4
5
6
7
8
9
10
11
12
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-18 XX:XX:XX [scrapy.extensions.feedexport] INFO: Stored json feed (100 items) in: quotes.json
2020-05-18 XX:XX:XX [quotes] INFO: Spider closed!!!!!!!!!: quotes
2020-05-18 XX:XX:XX [quotes] INFO: {'log_count/INFO': 13, 'start_time': datetime.datetime(2020, 5, 18, xx, xx, xx, xxxxxx), 'memusage/startup': 55676928, 'memusage/max': 55676928, 'scheduler/enqueued/memory': 10, 'scheduler/enqueued': 10, 'scheduler/dequeued/memory': 10, 'scheduler/dequeued': 10, 'downloader/request_count': 11, 'downloader/request_method_count/GET': 11, 'downloader/request_bytes': 2895, 'robotstxt/request_count': 1, 'downloader/response_count': 11, 'downloader/response_status_count/404': 1, 'downloader/response_bytes': 24911, 'log_count/DEBUG': 111, 'response_received_count': 11, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/404': 1, 'downloader/response_status_count/200': 10, 'item_scraped_count': 100, 'request_depth_max': 9, 'elapsed_time_seconds': 8.22286, 'finish_time': datetime.datetime(2020, 5, 18, xx, xx, xx, xxxxxx), 'finish_reason': 'finished'}
2020-05-18 XX:XX:XX [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2895,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 24911,
'downloader/response_count': 11,
'downloader/response_status_count/200': 10,
'downloader/response_status_count/404': 1,