[Python] Scrapyを使ってみる (1)

Scrapy

Scrapyとは、Pythonでクローラを実装するフレームワークです。

webページの巡回→webページからの情報の取得→取得した情報の加工・保存、という一連の流れを、Scrapyの中で完結することができます。

Scrapy公式サイト

まずは、公式のチュートリアルに沿って、Scrapyを使ってみます。チュートリアルは以下のような構成になっています。

Scrapyのプロジェクトの作成
ページを巡回して情報を取得するspiderの作成
データを外部ファイルに保存
spiderをリンクを辿るように変更
spiderにコマンドラインの引数を渡す

チュートリアルでは、こちらのサイトをクロールしています。

インストール

pipでインストールします。

Installation guide

$ pip install scrapy

Scrapyのプロジェクトの作成

Scrapyのプロジェクトを作成します。

プロジェクトを作成すると、Scrapyはクローリングに必要な基本的なファイルを自動的に作成してくれます。プロジェクト内に生成されたファイルを編集することで、クローリングをコントロールします。

以下のコマンドを、プロジェクトを作成するディレクトリで実行します。scrapy startproject プロジェクトの名前、というコマンドです。

$ scrapy startproject tutorial

どのようなファイルが作成されたか、treeコマンドで見てみます。

$ tree ./tutorial /F
フォルダー パスの一覧:  ボリューム Windows
ボリューム シリアル番号は D8D0-XXXX です
C:\USERS\USER1\DOCUMENTS\TMP\TUTORIAL
│  scrapy.cfg
│
└─tutorial
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  __init__.py
    │  │
    │  └─__pycache__
    └─__pycache__

spiderの作成

プロジェクト内に、spiderを作成します。

spiderが、実際にページを巡回して情報を取得します。

tutorial/spidersspideディレクトリに、quotes_spider.pyというファイルを以下の内容で作成します。

import scrapy

# scrapy.Spiderを継承する。
class QuotesSpider(scrapy.Spider):
    # nameはプロジェクト内でspiderを識別するためにつける。
    name = "quotes"
    
    # クロールするページを記述する。
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # クロールしたそれぞれのページで取得する情報について記述する。
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

また、このコードは以下のように短くできます。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

spiderの実行

プロジェクトのトップディレクトリ、ここではtutorialに移動して、下記のコマンドを実行します。scrapy crawl spiderのname というコマンドです。

$ scrapy crawl quotes

画面に色々と出力され、最終的にspiderが巡回したページが、tutorialフォルダに保存されます。この段階では、特に情報を抜き出したりしていないので、巡回したhtmlファイルがそのまま保存されています。

情報の抜き出し/ scrapy shell

scrapyで取得したい情報を、巡回したページから抜き出します。

css形式またはxpath形式で取得したい部分を指定していくのですが、どのように指定するか、scrapy shell というツールを使って調べることができます。

以下のコマンドを実行します。

$ scrapy shell "http://quotes.toscrape.com/page/1/"

すると、ページを読み込んで、最終的に以下のような、こちらからの入力待ちの状態になります。

[s] Available Scrapy objects:[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

css形式

scrapy shellで読み込んだ上のページから、css形式でtitleタグを取得してみます。

>>> response.css('title')
[‹Selector xpath='descendant-or-self::title' data='‹title›Quotes to Scrape‹/title>'›]

上で取得したオブジェクトから、それぞれ下記のようなメソッドで情報が取得できます。

# リスト形式でタグとテキスト部分を取得する。
>>> response.css('title').getall()
['Quotes to Scrape']
# リスト形式でテキスト部分のみを取得する。
>>> response.css('title::text').getall()
['Quotes to Scrape']
# テキスト部分を一つだけ取得する。
>>> response.css('title::text').get()
'Quotes to Scrape'

正規表現を使うこともできます。

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

xpath形式

xpath形式でtitleタグを取得します。

>>> response.xpath('//title')
[‹Selector xpath='descendant-or-self::title' data='‹title›Quotes to Scrape‹/title>'›]

css形式と同じように取得できました。

その他、要素の取得については、下記がマニュアルになります。

Selectors

複数の情報の抜き出し

チュートリアルの対象サイトから、quote、author、tagsを抜き出してみます。

ソースを見ると、以下のようなまとまりで情報が格納されています。

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > 
            
            <a class="tag" href="/tag/change/page/1/">change</a>
            
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            
            <a class="tag" href="/tag/world/page/1/">world</a>
            
        </div>
    </div>

早速、scrapy shellを起動します。

$ scrapy shell "http://quotes.toscrape.com"

class=”quote”のdivタグで囲まれた部分を取得します。

>>> response.css("div.quote")
[‹Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='‹div class="quote" itemscope itemtype="h'›,
# 省略
 ‹Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='‹div class="quote" itemscope itemtype="h'›]

最初のアインシュタインに絞って、取得したいを抜き出します。

>>> quote = response.css("div.quote")[0]
>>> title = quote.css("span.text::text").get()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.
”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

無事に取得できたので、ループを回して全ての要素を確認します。

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").get()
...     author = quote.css("small.author::text").get()
...     tags = quote.css("div.tags a.tag::text").getall()
...     print(dict(text=text, author=author, tags=tags))
...
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our t
hinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. R
owling', 'tags': ['abilities', 'choices']}
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though
everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'mirac
les']}
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'
, 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely bor
ing.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags'
: ['adulthood', 'success', 'value']}
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gid
e', 'tags': ['life', 'love']}
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags':
 ['edison', 'failure', 'inspirational', 'paraphrased']}
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleano
r Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious
', 'simile']}

spiderによる情報の取得と外部ファイルへの保存

ここまでのコードをspiderに組み込み、実行します。

import scrapy

# scrapy.Spiderを継承する。
class QuotesSpider(scrapy.Spider):
    # nameはプロジェクト内でspiderを識別するためにつける。
    name = "quotes"
    
    # クロールするページを記述する。
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # クロールしたそれぞれのページで取得する情報について記述する。
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

$ scrapy crawl quotes

画面への出力で、情報が取得できていることが確認できます。

取得した情報は、-oオプションをつけることで、ファイルに保存することができます。

下記ではjsonファイルに出力しています。

$ scrapy crawl quotes -o qutes.json

scrapyは、ファイルに出力する際に、ファイルを上書きではなく追加する形になるので、公式では、json line形式での出力を推奨しています。

$ scrapy crawl quotes -o qutes.jl

ページのリンクを自動的にたどる

spiderにリンクを辿って、ページを巡回してもらいます。

ソースを確認すると、次のページへのリンクは、下記のように書かれています。

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
    </li>
</ul>

scrapy shellを起動して、次のページへのリンクを取得してみます。

>>> response.css('li.next a').get()
'‹a href="/page/2/"›Next ‹span aria-hidden="true"›→‹/span›‹/a›'
# リンクの取得方法1
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
# リンクの取得方法2
>>> response.css('li.next a').attrib['href']
'/page/2/'

取得できているので、spiderのコードに追加します。

import scrapy

# scrapy.Spiderを継承する。
class QuotesSpider(scrapy.Spider):
    # nameはプロジェクト内でspiderを識別するためにつける。
    name = "quotes"
    
    # 最初にクロールするページを記述する。
    def start_requests(self):
        url = 'http://quotes.toscrape.com/page/1/'
        yield scrapy.Request(url=url, callback=self.parse)

    # クロールしたそれぞれのページで取得する情報について記述する。
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        # 次のページへのリンクを取得して、リンクがあれば遷移して、parseを実行。
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

今回はcsvファイルに出力してみます。

$ scrapy crawl quotes -o quotes.csv

spiderにコマンドラインの引数を渡す

コマンドラインで、-aオプションを使うと、spiderにその引数を渡すことができます。

ここでは、tagとしてlifeを渡すことで、http://quotes.toscrape.com/tag/life だけを巡回させます。spiderでは、self.tagとしてアクセスできますが、もし指定されなかった場合に対応するため、組み込み関数getattrを用いています。

import scrapy

# scrapy.Spiderを継承する。
class QuotesSpider(scrapy.Spider):
    # nameはプロジェクト内でspiderを識別するためにつける。
    name = "quotes"
    
    # 最初にクロールするページを記述する。
    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url=url, callback=self.parse)

    # クロールしたそれぞれのページで取得する情報について記述する。
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        # 次のページへのリンクを取得して、リンクがあれば遷移して、parseを実行。
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

下記コマンドを実行します。

$ scrapy crawl quotes -o quotes.csv -a tag=life

出力されるcsvファイルを確認すると、tagには全てlifeが含まれています。