Skip to main content

Heroku Selenium 网页抓取完整指南

Selenium 是一个强大的浏览器自动化工具,结合 Heroku 平台可以构建强大的网页抓取和自动化测试解决方案。本指南将详细介绍如何在 Heroku 上配置和使用 Selenium。

目录

Selenium 简介

什么是 Selenium

Selenium 是用于 Web 应用程序测试的自动化工具:

  • 多浏览器支持:Chrome、Firefox、Safari、Edge
  • 多语言支持:Python、Java、JavaScript、C#、Ruby
  • 强大功能:元素定位、表单填写、截图、页面导航
  • 应用场景:自动化测试、数据抓取、表单提交、页面监控

核心组件

组件功能
WebDriver浏览器驱动,核心 API
Selenium Grid分布式测试执行
IDE录制和回放工具
RC远程控制(已废弃)

Heroku 平台特点

优势和挑战

优势

  • 易部署:Git 推送即可部署
  • 自动扩展:根据负载自动调整资源
  • 成本效益:按使用付费
  • 多语言支持:Python、Node.js 等

挑战

  • 临时文件系统:文件不会持久保存
  • 进程重启:24 小时可能重启一次
  • 网络限制:某些端口和协议有限制
  • 无根权限:无法安装系统级软件

准备工作

账户和工具

  1. Heroku 账户:免费注册 heroku.com
  2. Heroku CLI:用于部署和管理应用
  3. Git:版本控制系统
  4. Python 3.7+:编程语言环境

安装 Heroku CLI

# macOS
brew tap heroku/brew && brew install heroku

# Ubuntu/Debian
curl https://cli-assets.heroku.com/install.sh | sh

# Windows(使用 Chocolatey)
choco install heroku-cli

验证安装:

heroku --version
heroku login

环境配置

步骤 1:创建 Flask 应用结构

selenium-heroku/
├── app.py
├── requirements.txt
├── runtime.txt
├── Procfile
├── .gitignore
└── selenium_utils.py

步骤 2:配置 requirements.txt

Flask==2.3.3
gunicorn==21.2.0
selenium==4.15.2
webdriver-manager==4.0.1
requests==2.31.0
beautifulsoup4==4.12.2
fake-useragent==1.4.0
undetected-chromedriver==3.5.4

步骤 3:设置 Buildpacks

在 Heroku 应用设置中添加以下 Buildpacks(按顺序):

  1. Python(官方支持)
  2. Google Chromehttps://github.com/heroku/heroku-buildpack-google-chrome
  3. ChromeDriverhttps://github.com/heroku/heroku-buildpack-chromedriver

步骤 4:配置环境变量

在 Heroku Config Vars 中添加:

CHROMEDRIVER_PATH = /app/.chromedriver/bin/chromedriver
GOOGLE_CHROME_BIN = /app/.apt/usr/bin/google-chrome

基础示例

简单爬虫应用

创建 selenium_utils.py

import os
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException

class SeleniumHelper:
"""Selenium 辅助类"""

def __init__(self, headless=True, timeout=30):
self.timeout = timeout
self.driver = self._create_driver(headless)

def _create_driver(self, headless=True):
"""创建 WebDriver"""
chrome_options = Options()
配置
# 基本
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")

# 性能优化
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-logging")
chrome_options.add_argument("--silent")

# 视口设置
chrome_options.add_argument("--window-size=1920,1080")

# 用户代理
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")

# 禁用图片加载(提高速度)
prefs = {
"profile.managed_default_content_settings.images": 2,
"profile.default_content_setting_values.notifications": 2
}
chrome_options.add_experimental_option("prefs", prefs)

# 创建驱动
driver_path = os.environ.get("CHROMEDRIVER_PATH")
driver = webdriver.Chrome(
executable_path=driver_path,
options=chrome_options
)

driver.set_page_load_timeout(self.timeout)
driver.implicitly_wait(10)

return driver

def get_page(self, url, retries=3):
"""获取页面"""
for attempt in range(retries):
try:
self.driver.get(url)
WebDriverWait(self.driver, self.timeout).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
return True
except TimeoutException:
if attempt == retries - 1:
raise
time.sleep(2)
return False

def find_element_safe(self, by, value):
查找元素"""
"""安全 try:
element = WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((by, value))
)
return element
except (TimeoutException, NoSuchElementException):
return None

def find_elements_safe(self, by, value):
"""安全查找多个元素"""
try:
elements = WebDriverWait(self.driver, 10).until(
EC.presence_of_all_elements_located((by, value))
)
return elements
except (TimeoutException, NoSuchElementException):
return []

def get_page_source(self):
"""获取页面源码"""
return self.driver.page_source

def get_title(self):
"""获取页面标题"""
return self.driver.title

def take_screenshot(self, filename="screenshot.png"):
"""截图"""
self.driver.save_screenshot(filename)
return filename

def close(self):
"""关闭浏览器"""
if self.driver:
self.driver.quit()

def __enter__(self):
return self

def __exit__(self, exc_type, exc_val, exc_tb):
self.close()

Flask 应用入口

创建 app.py

import os
from flask import Flask, request, jsonify, render_template_string
from selenium_utils import SeleniumHelper
import traceback

app = Flask(__name__)

HTML_TEMPLATE = """
<!DOCTYPE html>
<html>
<head>
<title>Selenium Heroku Demo</title>
<style>
body { font-family: Arial; max-width: 800px; margin: 50px auto; padding: 20px; }
h1 { color: #333; }
form { background: #f5f5f5; padding: 20px; border-radius: 5px; }
input[type="url"] { width: 100%; padding: 10px; margin: 10px 0; }
button { background: #007bff; color: white; padding: 10px 20px; border: none; cursor: pointer; }
button:hover { background: #0056b3; }
.result { margin-top: 20px; padding: 15px; background: #e9ecef; border-radius: 5px; }
.error { background: #f8d7da; color: #721c24; }
.success { background: #d4edda; color: #155724; }
</style>
</head>
<body>
<h1>Selenium Heroku 演示</h1>
<form method="POST">
<label>输入要抓取的 URL:</label>
<input type="url" name="url" placeholder="https://example.com" required>
<button type="submit">开始抓取</button>
</form>

{% if result %}
<div class="result {{ 'success' if not error else 'error' }}">
<h3>结果:</h3>
<pre>{{ result }}</pre>
</div>
{% endif %}
</body>
</html>
"""

@app.route('/', methods=['GET', 'POST'])
def index():
result = None
error = None

if request.method == 'POST':
url = request.form.get('url')

if not url:
error = "请提供有效的 URL"
else:
try:
with SeleniumHelper(headless=True) as helper:
helper.get_page(url)
result = f"页面标题: {helper.get_title()}\nURL: {url}"
except Exception as e:
error = f"错误: {str(e)}\n{traceback.format_exc()}"

return render_template_string(HTML_TEMPLATE, result=result, error=error)

@app.route('/scrape', methods=['POST'])
def scrape():
"""API 接口抓取"""
data = request.get_json()
url = data.get('url')

if not url:
return jsonify({'error': 'URL is required'}), 400

try:
with SeleniumHelper(headless=True) as helper:
helper.get_page(url)
result = {
'title': helper.get_title(),
'url': url,
'success': True
}
return jsonify(result)
except Exception as e:
return jsonify({
'error': str(e),
'success': False
}), 500

@app.route('/health')
def health():
"""健康检查"""
return jsonify({'status': 'healthy', 'message': 'Selenium service is running'})

if __name__ == '__main__':
port = int(os.environ.get('PORT', 5000))
app.run(host='0.0.0.0', port=port, debug=False)

Procfile

web: gunicorn app:app

runtime.txt

python-3.11.4

高级配置

使用 undetected-chromedriver

绕过反检测系统:

import undetected_chromedriver as uc

def create_stealth_driver():
"""创建不易被检测的驱动"""
options = uc.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = uc.Chrome(options=options)
return driver

代理配置

def create_driver_with_proxy(proxy_host, proxy_port, proxy_user=None, proxy_pass=None):
"""使用代理创建驱动"""
chrome_options = Options()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")

# 代理配置
if proxy_user and proxy_pass:
proxy = f"{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
chrome_options.add_argument(f"--proxy-server=http://{proxy}")
else:
chrome_options.add_argument(f"--proxy-server=http://{proxy_host}:{proxy_port}")

driver_path = os.environ.get("CHROMEDRIVER_PATH")
driver = webdriver.Chrome(
executable_path=driver_path,
options=chrome_options
)

return driver

随机用户代理

from fake_useragent import UserAgent
import random

ua = UserAgent()

def create_driver_with_random_ua():
"""使用随机用户代理创建驱动"""
chrome_options = Options()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")

# 随机用户代理
user_agent = ua.random
chrome_options.add_argument(f"--user-agent={user_agent}")

driver_path = os.environ.get("CHROMEDRIVER_PATH")
driver = webdriver.Chrome(
executable_path=driver_path,
options=chrome_options
)

return driver

网页抓取实战

电商商品抓取

def scrape_product_info(url):
"""抓取商品信息"""
with SeleniumHelper(headless=True) as helper:
helper.get_page(url)

# 等待元素加载
time.sleep(3)

# 抓取商品名称
name_element = helper.find_element_safe(By.CSS_SELECTOR, ".product-title")
name = name_element.text if name_element else "未知"

# 抓取价格
price_element = helper.find_element_safe(By.CSS_SELECTOR, ".price")
price = price_element.text if price_element else "未知"

# 抓取评分
rating_element = helper.find_element_safe(By.CSS_SELECTOR, ".rating")
rating = rating_element.get_attribute("data-rating") if rating_element else "未知"

# 抓取图片
img_elements = helper.find_elements_safe(By.CSS_SELECTOR, ".product-images img")
images = [img.get_attribute("src") for img in img_elements[:5]]

return {
'name': name,
'price': price,
'rating': rating,
'images': images,
'url': url
}

新闻文章抓取

def scrape_article(url):
"""抓取新闻文章"""
with SeleniumHelper(headless=True) as helper:
helper.get_page(url)

# 等待内容加载
WebDriverWait(helper.driver, 20).until(
EC.presence_of_element_located((By.TAG_NAME, "article"))
)

# 抓取标题
title_element = helper.find_element_safe(By.TAG_NAME, "h1")
title = title_element.text if title_element else "无标题"

# 抓取作者
author_element = helper.find_element_safe(By.CSS_SELECTOR, ".author")
author = author_element.text if author_element else "未知作者"

# 抓取发布日期
date_element = helper.find_element_safe(By.CSS_SELECTOR, ".date")
date = date_element.text if date_element else "未知日期"

# 抓取正文
content_element = helper.find_element_safe(By.TAG_NAME, "article")
content = content_element.text if content_element else ""

# 清理文本
content = content.replace('\n\n\n', '\n\n')

return {
'title': title,
'author': author,
'date': date,
'content': content,
'url': url
}

批量抓取

def batch_scrape(urls, delay=2):
"""批量抓取多个页面"""
results = []

with SeleniumHelper(headless=True) as helper:
for url in urls:
try:
print(f"正在抓取: {url}")
helper.get_page(url)

title = helper.get_title()

results.append({
'url': url,
'title': title,
'success': True
})

# 延迟避免请求过快
time.sleep(delay)

except Exception as e:
results.append({
'url': url,
'error': str(e),
'success': False
})

return results

自动化测试

单元测试示例

创建 test_selenium.py

import unittest
from selenium_utils import SeleniumHelper
from selenium.webdriver.common.by import By

class TestSelenium(unittest.TestCase):

def setUp(self):
self.helper = SeleniumHelper(headless=True)

def tearDown(self):
self.helper.close()

def test_basic_navigation(self):
"""测试基本导航"""
self.helper.get_page("https://example.com")
self.assertIn("Example", self.helper.get_title())

def test_element_finding(self):
"""测试元素查找"""
self.helper.get_page("https://example.com")
element = self.helper.find_element_safe(By.TAG_NAME, "h1")
self.assertIsNotNone(element)

def test_form_filling(self):
"""测试表单填写"""
self.helper.get_page("https://httpbin.org/forms/post")

# 填写表单
input_element = self.helper.find_element_safe(By.NAME, "custname")
input_element.send_keys("测试用户")

submit_button = self.helper.find_element_safe(By.CSS_SELECTOR, "button[type='submit']")
submit_button.click()

# 验证提交
self.assertIn("custname", self.helper.get_page_source())

运行测试

# 在 Heroku 上运行测试
heroku run python -m pytest test_selenium.py -v

性能优化

1. 禁用不必要的功能

def create_optimized_driver():
"""优化驱动的创建"""
chrome_options = Options()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

# 禁用图片和视频
prefs = {
"profile.managed_default_content_settings.images": 2,
"profile.default_content_setting_values.media_stream_mic": 2,
"profile.default_content_setting_values.media_stream_camera": 2
}
chrome_options.add_experimental_option("prefs", prefs)

# 禁用扩展和插件
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-plugins")
chrome_options.add_argument("--disable-images")

# 内存优化
chrome_options.add_argument("--memory-pressure-off")
chrome_options.add_argument("--max_old_space_size=4096")

driver = webdriver.Chrome(
executable_path=os.environ.get("CHROMEDRIVER_PATH"),
options=chrome_options
)

return driver

2. 使用连接池

from contextlib import contextmanager
import threading

class DriverPool:
"""驱动连接池"""
_lock = threading.Lock()
_pools = {}

@classmethod
@contextmanager
def get_driver(cls, max_size=5):
"""获取驱动"""
thread_id = threading.get_ident()

with cls._lock:
if thread_id not in cls._pools:
cls._pools[thread_id] = []

pool = cls._pools[thread_id]

if pool:
driver = pool.pop()
else:
driver = create_optimized_driver()

try:
yield driver
finally:
if len(pool) < max_size:
pool.append(driver)
else:
driver.quit()

@classmethod
def close_all(cls):
"""关闭所有驱动"""
with cls._lock:
for pool in cls._pools.values():
for driver in pool:
driver.quit()
cls._pools.clear()

3. 缓存策略

from functools import lru_cache
import hashlib
import json

def cache_result(maxsize=100):
"""缓存结果装饰器"""
def decorator(func):
@lru_cache(maxsize=maxsize)
def wrapper(url_hash, *args, **kwargs):
return func(*args, **kwargs)

def inner(url, *args, **kwargs):
url_hash = hashlib.md5(url.encode()).hexdigest()
return wrapper(url_hash, *args, **kwargs)

return inner
return decorator

@cache_result(maxsize=50)
def scrape_with_cache(url):
"""带缓存的抓取"""
with SeleniumHelper() as helper:
helper.get_page(url)
return {
'title': helper.get_title(),
'url': url
}

错误处理

重试机制

import backoff
from selenium.common.exceptions import WebDriverException

@backoff.on_exception(
backoff.expo,
WebDriverException,
max_tries=3,
max_time=60
)
def scrape_with_retry(url):
"""带重试的抓取"""
with SeleniumHelper() as helper:
helper.get_page(url)
return helper.get_title()

优雅降级

def scrape_with_fallback(urls):
"""抓取并优雅降级"""
results = []

# 尝试使用 Selenium
try:
with SeleniumHelper(headless=True) as helper:
for url in urls:
try:
helper.get_page(url)
results.append({
'url': url,
'title': helper.get_title(),
'method': 'selenium'
})
except Exception as e:
# 降级到 requests
results.append({
'url': url,
'error': str(e),
'method': 'fallback'
})
except Exception as e:
# 完全失败
return [{
'url': url,
'error': str(e),
'method': 'failed'
} for url in urls]

return results

错误监控

import logging
from datetime import datetime

# 配置日志
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('selenium.log'),
logging.StreamHandler()
]
)

logger = logging.getLogger(__name__)

def log_error(url, error):
"""记录错误"""
logger.error(f"抓取失败 - URL: {url}, 错误: {error}")

# 可以发送到监控系统
# send_to_monitoring(error, url)

def scrape_with_logging(url):
"""带日志的抓取"""
try:
logger.info(f"开始抓取: {url}")

with SeleniumHelper() as helper:
helper.get_page(url)
title = helper.get_title()

logger.info(f"抓取成功: {url}, 标题: {title}")
return title

except Exception as e:
error_msg = f"抓取失败: {str(e)}"
logger.error(error_msg)
log_error(url, str(e))
raise

反爬虫应对策略

1. 动态用户代理

import random

USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
]

def create_driver_with_random_ua():
"""随机用户代理"""
chrome_options = Options()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")

user_agent = random.choice(USER_AGENTS)
chrome_options.add_argument(f"--user-agent={user_agent}")

driver = webdriver.Chrome(
executable_path=os.environ.get("CHROMEDRIVER_PATH"),
options=chrome_options
)

return driver

2. 随机延迟

import random
import time

def random_delay(min_seconds=1, max_seconds=5):
"""随机延迟"""
delay = random.uniform(min_seconds, max_seconds)
time.sleep(delay)

def scrape_with_delay(urls):
"""带延迟的抓取"""
results = []

with SeleniumHelper() as helper:
for i, url in enumerate(urls):
# 随机延迟
if i > 0:
random_delay(1, 3)

try:
helper.get_page(url)
results.append({'url': url, 'success': True})
except Exception as e:
results.append({'url': url, 'error': str(e)})

return results

3. 模拟人类行为

from selenium.webdriver.common.action_chains import ActionChains

def human_like_scroll(driver):
"""模拟人类滚动"""
actions = ActionChains(driver)

# 随机滚动
for _ in range(random.randint(2, 5)):
scroll_height = random.randint(200, 800)
actions.scroll_by_amount(0, scroll_height).perform()
time.sleep(random.uniform(0.5, 2))

# 随机点击
actions.click().perform()

def scrape_with_human_behavior(url):
"""模拟人类行为抓取"""
with SeleniumHelper() as helper:
helper.get_page(url)

# 模拟滚动
human_like_scroll(helper.driver)

# 随机移动鼠标
actions = ActionChains(helper.driver)
actions.move_by_offset(random.randint(0, 100), random.randint(0, 100)).perform()

return helper.get_title()

最佳实践

1. 代码组织

selenium_project/
├── app/
│ ├── __init__.py
│ ├── utils/
│ │ ├── selenium_helper.py
│ │ ├── proxy_manager.py
│ │ └── cache.py
│ ├── scrapers/
│ │ ├── product_scraper.py
│ │ ├── news_scraper.py
│ │ └── base_scraper.py
│ └── api/
│ └── endpoints.py
├── tests/
│ ├── test_scraper.py
│ └── test_selenium.py
├── requirements.txt
└── app.py

2. 配置管理

# config.py
import os

class Config:
"""基础配置"""
TIMEOUT = int(os.environ.get('TIMEOUT', 30))
MAX_RETRIES = int(os.environ.get('MAX_RETRIES', 3))
HEADLESS = os.environ.get('HEADLESS', 'True').lower() == 'true'
USE_PROXY = os.environ.get('USE_PROXY', 'False').lower() == 'true'

class ProductionConfig(Config):
"""生产环境配置"""
HEADLESS = True
TIMEOUT = 30

class DevelopmentConfig(Config):
"""开发环境配置"""
HEADLESS = False
TIMEOUT = 60

config = {
'development': DevelopmentConfig,
'production': ProductionConfig,
'default': ProductionConfig
}

3. 测试策略

# 测试驱动是否可用
def test_driver():
"""测试驱动"""
try:
driver = create_optimized_driver()
driver.get("https://example.com")
title = driver.title
driver.quit()
return True, title
except Exception as e:
return False, str(e)

# 在应用启动时测试
if __name__ == '__main__':
success, result = test_driver()
if success:
print(f"驱动测试成功: {result}")
else:
print(f"驱动测试失败: {result}")

4. 资源管理

from contextlib import contextmanager

@contextmanager
def managed_driver():
"""上下文管理的驱动"""
driver = None
try:
driver = create_optimized_driver()
yield driver
finally:
if driver:
driver.quit()

# 使用
with managed_driver() as driver:
driver.get("https://example.com")
print(driver.title)

常见问题

1. 内存不足

问题:Heroku dyno 内存不足导致应用崩溃

解决方案

# 减少内存使用
chrome_options.add_argument("--memory-pressure-off")
chrome_options.add_argument("--max_old_space_size=1024")

# 限制并发数
MAX_CONCURRENT = int(os.environ.get('MAX_CONCURRENT', 2))

2. 超时问题

问题:页面加载超时

解决方案

# 增加超时时间
driver.set_page_load_timeout(60)

# 使用显式等待
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

WebDriverWait(driver, 60).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)

3. 元素找不到

问题:动态加载的元素找不到

解决方案

# 等待元素出现
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 等待元素可见
element = WebDriverWait(driver, 10).until(
EC.visibility_of_element_located((By.ID, "element-id"))
)

# 等待元素可点击
element = WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.ID, "element-id"))
)

4. 反检测

问题:被网站检测为机器人

解决方案

# 使用 undetected-chromedriver
import undetected_chromedriver as uc

driver = uc.Chrome()

# 随机化参数
options.add_argument(f"--window-size={random.randint(800, 1920)},{random.randint(600, 1080)}")

5. 部署失败

问题:Buildpack 或配置错误

解决方案

# 检查 Buildpack 顺序
heroku buildpacks

# 查看构建日志
heroku logs --tail --ps build

# 检查环境变量
heroku config

总结

本指南全面介绍了在 Heroku 平台上使用 Selenium 进行网页抓取的完整方案:

  • 环境配置:Buildpacks、Chrome、ChromeDriver 设置
  • 基础用法:SeleniumHelper 类和基本操作
  • 高级功能:代理、用户代理、反检测
  • 实战案例:电商、新闻、批量抓取
  • 性能优化:缓存、连接池、禁用功能
  • 错误处理:重试机制、优雅降级、日志监控
  • 反爬虫策略:随机化、模拟人类行为

通过掌握这些技能,您可以:

  • 构建强大的网页抓取系统
  • 绕过简单的反爬虫机制
  • 处理大规模数据抓取任务
  • 实现自动化测试和监控
  • 优化性能和资源使用

相关资源

持续实践这些技术,您将成为网页抓取专家!