如果只是在Flask中调用Scrapy爬虫,可能会遇到如下错误:
ValueError: signal only works in main thread
# 或者
twisted.internet.error.ReactorNotRestartable
解决的办法有几个。
1 使用python子进程(subproccess)
首先,确保目录结构类似如下:
> tree -L 1
├── dirbot
├── README.rst
├── scrapy.cfg
├── server.py
└── setup.py
然后在,新进程中启动爬虫:
# server.py
import subprocess
from flask import Flask
app = Flask(__name__)
@app.route('/')
def hello_world():
"""
Run spider in another process and store items in file. Simply issue command:
> scrapy crawl dmoz -o "output.json"
wait for this command to finish, and read output.json to client.
"""
spider_name = "dmoz"
subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])
with open("output.json") as items_file:
return items_file.read()
if __name__ == '__main__':
app.run(debug=True)
新进程中启动爬虫:
2 使用Twisted-Klein + Scrapy
代码如下:
# server.py
import json
from klein import route, run
from scrapy import signals
from scrapy.crawler import CrawlerRunner
from dirbot.spiders.dmoz import DmozSpider
class MyCrawlerRunner(CrawlerRunner):
"""
Crawler object that collects items and returns output after finishing crawl.
"""
def crawl(self, crawler_or_spidercls, *args, **kwargs):
# keep all items scraped
self.items = []
# create crawler (Same as in base CrawlerProcess)
crawler = self.create_crawler(crawler_or_spidercls)
# handle each item scraped
crawler.signals.connect(self.item_scraped, signals.item_scraped)
# create Twisted.Deferred launching crawl
dfd = self._crawl(crawler, *args, **kwargs)
# add callback - when crawl is done cal return_items
dfd.addCallback(self.return_items)
return dfd
def item_scraped(self, item, response, spider):
self.items.append(item)
def return_items(self, result):
return self.items
def return_spider_output(output):
"""
:param output: items scraped by CrawlerRunner
:return: json with list of items
"""
# this just turns items into dictionaries
# you may want to use Scrapy JSON serializer here
return json.dumps([dict(item) for item in output])
@route("/")
def schedule(request):
runner = MyCrawlerRunner()
spider = DmozSpider()
deferred = runner.crawl(spider)
deferred.addCallback(return_spider_output)
return deferred
run("localhost", 8080)
3 使用ScrapyRT
安装ScrapyRT,然后启动:
> scrapyrt
文章来源:https://stackoverflow.com/questions/36384286/how-to-integrate-flask-scrapy