Stats Collection

Scrapy provides a convenient facility for collecting stats in the form of key/values, where values are often counters. The facility is called the Stats Collector, and can be accessed through the stats attribute of the Crawler API, as illustrated by the examples in the Common Stats Collector uses section below.

統計情報は常に有効なので、Crawler APIの属性値を介してstatsにアクセスすることができる。

Spiderの中で統計情報を使う

公式チュートリアルのQuotesSpiderをカスタマイズして、統計情報を設定する。
SpiderはCrawlerを属性値として持つのでself.crawler.statsでStats Collectionにアクセスできる。

操作はStats Collector APIで行う。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
#print(self.crawler.stats.get_stats())
self.crawler.stats.inc_value('crawled_pages')
for quote in response.css('div.quote'):
self.crawler.stats.inc_value('crawled_items')
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

実行終了時に標準の統計情報の結果と共にSpiderの中で設定したcrawled_pagescrawled_itemsが表示されている。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-18 XX:XX:XX [scrapy.extensions.feedexport] INFO: Stored json feed (100 items) in: quotes.json
2020-05-18 XX:XX:XX [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'crawled_items': 100,
'crawled_pages': 10,
'downloader/request_bytes': 2895,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 24911,
'downloader/response_count': 11,
'downloader/response_status_count/200': 10,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 7.113748,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 18, xx, xx, xx, XXXXXX),
'item_scraped_count': 100,
'log_count/DEBUG': 111,
'log_count/INFO': 11,
'memusage/max': 55717888,
'memusage/startup': 55717888,
'request_depth_max': 9,
'response_received_count': 11,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
'start_time': datetime.datetime(2020, 5, 18, xx, xx, xx, XXXXXX)}
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Spider closed (finished)