Skip to main content

Scrapy 爬虫框架入门

Archived University Note

This content is from my university archives and may not be reliable or up-to-date.

Scrapy 爬虫开发步骤

  1. 继承 scrapy.Spider
  2. 为 Spider 取名字
  3. 设定起始爬取点
  4. 实现页面解析函数

第一个实例

爬取 books.toscrape.com 网站的书名和价格:

import scrapy

class BooksSpider(scrapy.Spider):
# 标识
name = "books"

# 起始点
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
for book in response.css('article.product_pod'):
name = book.xpath('./h3/a/@title').extract_first()
price = book.css('p.price_color::text').extract_first()
yield {
'name': name,
'price': price,
}

# 处理下一页
next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
if next_url:
next_url = response.urljoin(next_url)
yield scrapy.Request(next_url, callback=self.parse)

运行命令:

scrapy crawl books -o books.csv

使用 Selector 提取数据

创建 Selector 对象

>>> from scrapy.selector import Selector
>>> text = '''
<html>
<body>
<h1>hello world</h1>
<h1>hello scrapy</h1>
<b>hello python</b>
<ul>
<li>c++</li>
<li>java</li>
<li>python</li>
</ul>
</body>
</html>
'''

选中数据

>>> selector = Selector(text=text)
>>> selector
<Selector xpath=None data='<html>\n\t<body>\n\t\t<h1>hello world</h1>\n\t\t'>

>>> selector_list = selector.xpath('//h1')
>>> selector_list
[<Selector xpath='//h1' data='<h1>hello world</h1>'>, <Selector xpath='//h1' data='<h1>hello scrapy</h1>'>]

提取数据

使用 extract()extract_first() 方法从 Selector 对象中提取文本内容。