Scrapyのクローラーでファイルをダウンロードして保存する

Scrapyでファイルをダウンロードして保存する

scrapyで複数ページを巡回はCrawlSpider、ファイルのダウンロードはFilesPipelineを使うと簡潔に記述できる。
FilesPipelineはデフォルトではSha1ハッシュをファイル名にする実装なので、カスタマイズが必要。
ソースコードは簡潔で読みやすいので継承してカスタマイズするのは容易。

CrawlSpider

Spiders

要約すると、ポイントは以下

巡回対象のページをrulesにLinkExtractorで抽出
コールバックで抽出したページからアイテムを抽出

FilesPipeline

Downloading and processing files and images

要約すると、ポイントは以下

settings.pyのFILES_STOREでFILES_STOREによるダウンロード先ディレクトリを指定
settings.pyのITEM_PIPELINESでFilesPipelineを有効化
生成するアイテムにfile_urls属性を追加し、ダウンロードするファイルのURLsを指定
生成するアイテムにダウンロード結果を保存するfiiles属性を追加する

Using the Files Pipeline

The typical workflow, when using the FilesPipeline goes like this:

In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.

The item is returned from the spider and goes to the item pipeline.

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).

When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from the file_urls field) , and the file checksum. The files in the list of the files field will retain the same order of the original file_urls field. If some file failed downloading, an error will be logged and the file won’t be present in the files field.

Spiderでスクレイピングし、目的のURLをfile_urlsにセットすると、SchedulerとDownloaderを使ってスケジューリングされるが、優先度が高く他のページをスクレイピングする前に処理される。ダウンロード結果はfilesに記録する。

Enabling your Media Pipeline

To enable your media pipeline you must first add it to your project ITEM_PIPELINES setting.

For Images Pipeline, use:

ITEM_PIPELINES = {‘scrapy.pipelines.images.ImagesPipeline’: 1}
For Files Pipeline, use:

ITEM_PIPELINES = {‘scrapy.pipelines.files.FilesPipeline’: 1}

ITEM_PIPELINESでscrapy.pipelines.files.FilesPipeline': 1を指定して有効化する。
画像ファイルのためのImagesPipelineもある。

Supported Storage - File system storage

The files are stored using a SHA1 hash of their URLs for the file names.

ファイル名はSHA1ハッシュを使用する

IPAの情報処理試験のページをサンプルにCrawlSpiderを試す

対象のページ構造

起点となるページは各年度の過去問ダウンロードページへのリンクになっている。

IPAのページ width=640

各ページは試験区分ごとに過去問のPDFへのリンクがある。

IPAのページ width=640

project

https://www.jitec.ipa.go.jp/1_04hanni_sukiru/_index_mondai.html以下のページを巡回してPDFをダウンロードするプロジェクトを作成する。
Spiderのスケルトンを作成する際に-t crawlを指定し、CrawlSpiderのスケルトンを作成する。

1
2
3

scrapy startproject <プロジェクト名>
cd <プロジェクト名>
scrapy genspider -t crawl ipa www.ipa.go.jp

spiders/ipa.py

rulesで各年度の過去問ダウンロードページを抽出し、各ページを解析してPDF単位でアイテム化する。
file_urlsは複数指定できるが、ここでは1ファイル毎で指定している。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from crawldownload.items import CrawldownloadItem

class IpaSpider(CrawlSpider):
    name = 'ipa'
    allowed_domains = ['ipa.go.jp']
    start_urls = ['https://www.jitec.ipa.go.jp/1_04hanni_sukiru/_index_mondai.html']

    rules = (
        Rule(LinkExtractor(allow=r'1_04hanni_sukiru/mondai_kaitou'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        logger.info("{}".format(response.css('title::text').get()))

        for main_area in response.css('#ipar_main'):
            exam_seasons = main_area.css('h3').xpath('string()').extract()

            season = 0
            for exam_table in main_area.css('div.unit'):
                exam_season = exam_seasons[season]
                season+=1

                # ページ内のPDFファイルのアイテムを生成
                for exam_item in exam_table.css('tr'):
                    # リンクを含まないヘッダ部なので除く
                    if exam_item.css('a').get() is None:
                        continue

                    for exam_link in exam_item.css('a'):
                        exam_pdf = response.urljoin(exam_link.css('a::attr(href)').get())

                        item = CrawldownloadItem()
                        item['season'] = exam_season
                        item['title'] = exam_item.css('td p::text').getall()[1].strip()
                        item['file_title'] = exam_link.css('a::text').get()
                        item['file_urls'] = [ exam_pdf ]
                        yield item

items.py

files_urlsとfiles属性がFilesPipelineで必要になる属性

import scrapy

class CrawldownloadItem(scrapy.Item):
    season = scrapy.Field()
    title = scrapy.Field()
    file_title = scrapy.Field()
    file_urls = scrapy.Field()
    files = scrapy.Field()

pipelines.py

FilesPipelineはデフォルトでSHA1ハッシュのファイル名を使用するので、継承したクラスでfile_path()メソッドをオーバーライドする。
存在しないディレクトリも自動生成されるので、保存したいパスを生成して返せばいい。

from scrapy.pipelines.files import FilesPipeline

import os

class CrawldownloadPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_paths = request.url.split("/")
        file_paths.pop(0) # https:
        file_paths.pop(0) #//
        file_name = os.path.join(*file_paths)

        return file_name

1
2
3

response.url="https://www.jitec.ipa.go.jp/1_04hanni_sukiru/mondai_kaitou_2019h31_2/2019r01a_sg_am_qs.pdf"
↓↓↓
file_name="www.jitec.ipa.go.jp/1_04hanni_sukiru/mondai_kaitou_2019h31_2/2019r01a_sg_am_qs.pdf"

setting.py

FilesPipelineを有効化する。

FILES_STOREでダウンロード先ディレクトリを指定
ITEM_PIPELINESでFilesPipelineを有効化

デフォルト設定では多重度が高すぎるので、調整する。

同時アクセスは1
ダウンロード間隔3秒

# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
DOWNLOAD_DELAY = 3

…略…

FILES_STORE = 'download'

ITEM_PIPELINES = {
    #'scrapy.pipelines.files.FilesPipeline': 1,
    'crawldownload.pipelines.CrawldownloadPipeline': 1,
}

Scrapyのクローラーでファイルをダウンロードして保存する

Scrapyでファイルをダウンロードして保存する

CrawlSpider

FilesPipeline

Using the Files Pipeline

Enabling your Media Pipeline

Supported Storage - File system storage

IPAの情報処理試験のページをサンプルにCrawlSpiderを試す

対象のページ構造

project

spiders/ipa.py

items.py

pipelines.py

setting.py

nullpo

プロセスのメモリダンプをとる

KeyringでOSのパスワード管理機構を利用する

PythonでGmailを使ったメール送信

SMTPHandlerでログ出力をメール通知する

SlackのIncoming WebHooksを使う

Hexoを使った静的サイト作成

モダンWebホスティングサービスNetlify

Hexoの基本操作チュートリアル

Hexo Markdown

画像の引用

ハイパーリンク

カテゴリ・タグ

続きを読む

CI/CD