Scrapyで統計情報を記録する

Stats Collection

Scrapy provides a convenient facility for collecting stats in the form of key/values, where values are often counters. The facility is called the Stats Collector, and can be accessed through the stats attribute of the Crawler API, as illustrated by the examples in the Common Stats Collector uses section below.

統計情報は常に有効なので、Crawler APIの属性値を介してstatsにアクセスすることができる。

Spiderの中で統計情報を使う

公式チュートリアルのQuotesSpiderをカスタマイズして、統計情報を設定する。
SpiderはCrawlerを属性値として持つのでself.crawler.statsでStats Collectionにアクセスできる。

操作はStats Collector APIで行う。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        #print(self.crawler.stats.get_stats())
        self.crawler.stats.inc_value('crawled_pages')
        for quote in response.css('div.quote'):
            self.crawler.stats.inc_value('crawled_items')
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

実行終了時に標準の統計情報の結果と共にSpiderの中で設定したcrawled_pagesとcrawled_itemsが表示されている。

2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-18 XX:XX:XX [scrapy.extensions.feedexport] INFO: Stored json feed (100 items) in: quotes.json
2020-05-18 XX:XX:XX [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'crawled_items': 100,
 'crawled_pages': 10,
 'downloader/request_bytes': 2895,
 'downloader/request_count': 11,
 'downloader/request_method_count/GET': 11,
 'downloader/response_bytes': 24911,
 'downloader/response_count': 11,
 'downloader/response_status_count/200': 10,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 7.113748,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 5, 18, xx, xx, xx, XXXXXX),
 'item_scraped_count': 100,
 'log_count/DEBUG': 111,
 'log_count/INFO': 11,
 'memusage/max': 55717888,
 'memusage/startup': 55717888,
 'request_depth_max': 9,
 'response_received_count': 11,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 10,
 'scheduler/enqueued/memory': 10,
 'start_time': datetime.datetime(2020, 5, 18, xx, xx, xx, XXXXXX)}
2020-05-18 XX:XX:XX [scrapy.core.engine] INFO: Spider closed (finished)

Scrapyで統計情報を記録する

Stats Collection

Spiderの中で統計情報を使う

nullpo

プロセスのメモリダンプをとる

KeyringでOSのパスワード管理機構を利用する

PythonでGmailを使ったメール送信

SMTPHandlerでログ出力をメール通知する

SlackのIncoming WebHooksを使う

Hexoを使った静的サイト作成

モダンWebホスティングサービスNetlify

Hexoの基本操作チュートリアル

Hexo Markdown

画像の引用

ハイパーリンク

カテゴリ・タグ

続きを読む

CI/CD