Scrapy 爬虫框架入门

Archived University Note

This content is from my university archives and may not be reliable or up-to-date.

Scrapy 爬虫开发步骤

继承 scrapy.Spider
为 Spider 取名字
设定起始爬取点
实现页面解析函数

第一个实例

爬取 books.toscrape.com 网站的书名和价格：

import scrapy

class BooksSpider(scrapy.Spider):
    # 标识
    name = "books"

    # 起始点
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.css('article.product_pod'):
            name = book.xpath('./h3/a/@title').extract_first()
            price = book.css('p.price_color::text').extract_first()
            yield {
                'name': name,
                'price': price,
            }

        # 处理下一页
        next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
        if next_url:
            next_url = response.urljoin(next_url)
            yield scrapy.Request(next_url, callback=self.parse)

运行命令：

scrapy crawl books -o books.csv

使用 Selector 提取数据

创建 Selector 对象

>>> from scrapy.selector import Selector
>>> text = '''
<html>
    <body>
        <h1>hello world</h1>
        <h1>hello scrapy</h1>
        <b>hello python</b>
        <ul>
            <li>c++</li>
            <li>java</li>
            <li>python</li>
        </ul>
    </body>
</html>
'''

选中数据

>>> selector = Selector(text=text)
>>> selector
<Selector xpath=None data='<html>\n\t<body>\n\t\t<h1>hello world</h1>\n\t\t'>

>>> selector_list = selector.xpath('//h1')
>>> selector_list
[<Selector xpath='//h1' data='<h1>hello world</h1>'>, <Selector xpath='//h1' data='<h1>hello scrapy</h1>'>]

提取数据

使用 extract() 或 extract_first() 方法从 Selector 对象中提取文本内容。

Scrapy 爬虫开发步骤​

第一个实例​

使用 Selector 提取数据​

创建 Selector 对象​

选中数据​

提取数据​