[Python] Scrapyを使ってみる (2)

こちらの続きです。

今回は、以前のチュートリアルを基にして、コマンドによるspiderの生成、取得した情報のmysqlへの保存を行います。

以下を参考にしています。

Python製クローラー「Scrapy」の始め方メモ

PythonのScrapyでHTML、XML、CSV用のクローラーを作ってみる

10分で理解する Scrapy

プロジェクトとspiderの生成

まずはプロジェクトを生成します。

$ scrapy startproject tutorial2

続けてspiderを生成します。

# プロジェクトフォルダへの移動
$ cd tutorial2
# 扱えるtemplateの確認
$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed
# scrapy genspider [-t template] ‹name› ‹domain›
$ scrapy genspider -t crawl quotes2 quotes.toscrape.com

genspiderのドキュメントはこちらです。

templateは、scrapyをインストールしたpythonフォルダの\Lib\site-packages\scrapy\templates\spidersに入っています。また、githubではこちらにあります。

今回はページをクロールするbotを作るので、crawlを選んでいます。

生成されるフォルダ、ファイルは以下のようになります。

C:\USERS\USER\TUTORIAL2
│  scrapy.cfg
│
└─tutorial2
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  quotes2.py
    │  │  __init__.py
    │  │
    │  └─__pycache__
    │          __init__.cpython-37.pyc
    │
    └─__pycache__
            settings.cpython-37.pyc
            __init__.cpython-37.pyc

setting.pyの設定

setting.pyの設定を行い、サーバーへの負荷を軽減します。

ダウンロードの間隔

DOWNLOAD_DELAY のコメントを外して、ダウンロードの間隔を空けます。

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

キャッシュの有効化

HTTPのキャッシュを有効にして、同じコンテンツのダウンロードを避けます。

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 3600
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.pyの設定

ittems.pyの設定を行います。

前回見たように、特に設定を行わなくともスクレイピングは行えますが、items.pyに設定を行うことで、取得したデータをscrapy内で構造化して扱うことができるようになります。

前回と同じ情報+情報を取得したurlを収集します。

import scrapy


class Tutorial2Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field()
    author = scrapy.Field()
    tag = scrapy.Field()
    url = scrapy.Field()

spiderの設定

今回は、下記のように既に雛形が出来上がっているので、こちらを改変します。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Quotes2Spider(CrawlSpider):
    name = 'quotes2'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        return item

rulesとdef parse_item(self, response)が前回と異なっています。

Crawling rules

今回の設定の場合、rulesの中のLinkExtractorにより、spiderは自動的にページ内のリンクを探して見つけます。

このように、rulesの中に色々と記述することで、今いるページの情報を収集するかどうか、見つけたリンク先に進むかどうかという、spiderの動きをコントロールします。

Crawling rules

以下の公式の例を眺めると、使い方は何となく理解できます。

rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # 'category.php'にマッチするページのリンクを抜き出し、'subsection\.php'にマッチするページのリンクは抜き出さずに、
        # and follow links from them (since no callback means follow=True by default).
        # 抜き出したリンクをたどる。(callbackを指定しないとfollow=Trueと解釈される)
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        # item.phpにマッチするリンクを抜き出し、parse_itemメソッドで指定したように内容をパースする。
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

def parse_item(self, response)

前回はyieldでパースした内容を返していましたが、今回はitems.pyを設定したので、そちらにパースした内容を返すようコールバック関数を指定します。

quote2.py

前回と同じ内容+urlを取得するspiderです。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Quotes2Spider(CrawlSpider):
    name = 'quotes2'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(allow=r'quotes.toscrape.com/page/\d*/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        for quote in response.css('div.quote'):
            item = {}
            item['text'] = quote.css('span.text::text').get()
            item['author'] = quote.css('small.author::text').get()
            item['tag'] = quote.css('div.tags a.tag::text').getall()
            item['url'] = response.url
            yield item

retrun itemだと関数が終了してしまうので、yieldに変更しておきます。

spiderを走らせて、csvに取得した情報を保存してみると、多分収集できているようです。

pipelines.py

pipelins.pyは、spiderが収集してくれたitemへの処理を記述しておくと、spiderがitemを収集した時にその処理を実行してくれます。

Item Pipeline

settings.py

pipelinesを有効にするために、settings.pyの以下の部分のコメントアウトを外しておきます。

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'tutorial2.pipelines.Tutorial2Pipeline': 300,
}

雛形ファイル

コマンドが作成してくれる雛形は以下のようになっています。

class Tutorial2Pipeline(object):
    def process_item(self, item, spider):
        return item

itemを変更、削除

まず、authorは全て大文字にしてみます。

class Tutorial2Pipeline(object):
    def process_item(self, item, spider):
        item['author'] = item['author'].upper()
        return item

次に、http://quotes.toscrape.com/page/2/ の結果は削除してみます。

class Tutorial2Pipeline(object):
    def process_item(self, item, spider):
        item['author'] = item['author'].upper()
        if item['url'] == 'http://quotes.toscrape.com/page/2/':
            return
        return item

csvに保存するよう実行してみます。

$ scrapy crawl quotes2 -o quote.csv

とりあえず出来ているようです。

mysqlへの保存

sqliteの方が楽で良いのですが、ここでは、mysqlへsqlalchemyを使ってitemを保存する処理を記述してみようと思います。

以下を参照しています。

Scrapy Tutorial #9: How To Use Scrapy Item

データベースの作成

mysqlでデータベースを作成します。一応接続用のユーザーを設定します。

$ mysql -h localhost -u root -p
# 省略
mysql> create database quote_scrapy character set utf8 collate　utf8_general_ci;  
Query OK, 1 row affected, 2 warnings (0.11 sec)

mysql> show create database quote_scrapy;
+--------------+----------------------------------------------------------------------------------------------------------+
| Database     | Create Database
                                |
+--------------+----------------------------------------------------------------------------------------------------------+| quote_scrapy | CREATE DATABASE `quote_scrapy` /*!40100 DEFAULT CHARACTER SET utf8 */ /*!80016 DEFAULT ENCRYPTION='N' */ |
+--------------+----------------------------------------------------------------------------------------------------------+1 row in set (0.00 sec)

mysql> create user quote_scrapy@localhost identified by 'quote_scrapy';
Query OK, 0 rows affected (0.05 sec)

mysql> grant all privileges on quote_scrapy.* to quote_scrapy@localhost;
Query OK, 0 rows affected (0.08 sec)

mysql> exit
Bye

sqlalchemy

sqlalchemyを使って、データベースのモデルを作成します。

models.pyという名前で、pipelines.pyと同じフォルダに保存します。

from sqlalchemy import create_engine, Column, Table, ForeignKey
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import (
    Integer, SmallInteger, String, Date, DateTime, Float, Boolean, Text, LargeBinary)


CONNECTON_STRING = '{drivername}://{user}:{password}@{host}:{port}/{db_name}?charset=utf8'.format(
    drivername = 'mysql+pymysql',
    user = 'quote_scrapy',
    password = 'quote_scrapy',
    host = 'localhost',
    port = '3306',
    db_name = 'quote_scrapy'
)

DeclarativeBase = declarative_base()

def db_connect():
    return create_engine(CONNECTON_STRING, echo=True)

def create_table(engine):
    DeclarativeBase.metadata.create_all(engine)

class QuoteDatabase(DeclarativeBase):
    __tablename__ = 'quote_table'

    id = Column(Integer, primary_key=True)
    text = Column('text', Text())
    author = Column('author', String(255))
    tag = Column('tag', String(255))
    url = Column('url', String(255))

上のコードが動くかテストします。

test_models.pyという名前で、models.pyと同じフォルダに以下を作成して、実行してみます。

from sqlalchemy.orm import sessionmaker
import models


engine = models.db_connect()
models.create_table(engine)
Session = sessionmaker(bind=engine)

session = Session()
quotedb = models.QuoteDatabase()
quotedb.text = "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
quotedb.author = "ALBERT EINSTEIN"
quotedb.tag = "change,deep-thoughts,thinking,world"
quotedb.url = "http://quotes.toscrape.com/page/1/"

try:
    session.add(quotedb)
    session.commit()
    # データが挿入されているか確認。
    obj = session.query(models.QuoteDatabase).first()
    print(obj.text)
except:
    session.rollback()
    raise
finally:
    session.close()

挿入できているようなので、sqlalchemyのモデルは動くようです。

pipelines.pyに、mysqlへデータを保存するコードを記述します。

from sqlalchemy.orm import sessionmaker
from tutorial2.models import QuoteDatabase, db_connect, create_table

class Tutorial2Pipeline(object):
    def __init__(self):
        engine = db_connect()
        create_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        # 前処理
        item['author'] = item['author'].upper()
        if item['url'] == 'http://quotes.toscrape.com/page/2/':
            return

        # dbへの登録
        session = self.Session()
        quotedb = QuoteDatabase()
        quotedb.text = item['text']
        quotedb.author = item['author']
        quotedb.tag = item['tag']
        quotedb.url = item['url']

        try:
            session.add(quotedb)
            session.commit()
        except:
            session.rollback()
            raise
        finally:
            session.close()

        return item

クロールを実行してみます。

$ scrapy crawl quotes2

ログを確認するとエラーが出たり出なかったりしています。

デバッグ　リストのstringへの変換

エラーメッセージが長いですが、端的に言えば以下のエラーのようです。

sqlalchemy.exc.InternalError: (pymysql.err.InternalError) (1241, 'Operand should contain 1 column(s)')

エラーを出す入力でtest_models.pyをいじってみると、リストをデータベースに入力しようとしてエラーが出ているようなので、リストをjoinでstirngに変更するようにします。

from sqlalchemy.orm import sessionmaker
import models


engine = models.db_connect()
models.create_table(engine)
Session = sessionmaker(bind=engine)

session = Session()
quotedb = models.QuoteDatabase()
data = {'text': '“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”', 'author': 'GEORGE R.R. MARTIN', 'tag': ['books', 'mind'], 'url': 'http://quotes.toscrape.com/page/10/'}
quotedb.text = data['text']
quotedb.author = data['author']
quotedb.tag = ';'.join(data['tag'])
quotedb.url = data['url']

try:
    session.add(quotedb)
    session.commit()

    #query again
    obj = session.query(models.QuoteDatabase).first()
    print(obj.text)
except:
    session.rollback()
    raise
finally:
    session.close()

これでエラーが出なくなったようなので、piplines.pyを以下のように書き直します。

from sqlalchemy.orm import sessionmaker
from tutorial2.models import QuoteDatabase, db_connect, create_table

class Tutorial2Pipeline(object):
    def __init__(self):
        engine = db_connect()
        create_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        # 前処理
        item['author'] = item['author'].upper()
        # listをstirngに変換
        item['tag'] = ';'.join(item['tag'])
        if item['url'] == 'http://quotes.toscrape.com/page/2/':
            return

        # dbへの登録
        session = self.Session()
        quotedb = QuoteDatabase()
        quotedb.text = item['text']
        quotedb.author = item['author']
        quotedb.tag = item['tag']
        quotedb.url = item['url']

        try:
            session.add(quotedb)
            session.commit()
        except:
            session.rollback()
            raise
        finally:
            session.close()

        return item

再びクロールを実行します。

$ scrapy crawl quotes2

エラーが出ませんでした。

mysqlで確認すると、csvと同じ件数分のデータが保存できているようです。

$ mysql -h localhost -u root -p
# 省略

mysql> use quote_scrapy;
Database changed

mysql> select * from quote_table limit 1;
+----+-----------------------------------------------------------------------------------------------------------------------+-----------------+-------------------------------------+------------------------------------+
| id | text                                                                                                                  | author          | tag                                 | url                                |
+----+-----------------------------------------------------------------------------------------------------------------------+-----------------+-------------------------------------+------------------------------------+
|  1 | “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” | ALBERT EINSTEIN | change;deep-thoughts;thinking;world | http://quotes.toscrape.com/page/1/ |
+----+-----------------------------------------------------------------------------------------------------------------------+-----------------+-------------------------------------+------------------------------------+
1 row in set (0.00 sec)

mysql> select count(*) from quote_table;
+----------+
| count(*) |
+----------+
|       90 |
+----------+
1 row in set (0.07 sec)

プロジェクトとspiderの生成

setting.pyの設定

ダウンロードの間隔

キャッシュの有効化

items.pyの設定

spiderの設定

Crawling rules

def parse_item(self, response)

quote2.py

pipelines.py

settings.py

雛形ファイル

itemを変更、削除

mysqlへの保存

データベースの作成

sqlalchemy

デバッグ リストのstringへの変換

デバッグ　リストのstringへの変換