本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:
from scrapy import log from scrapy.http import Request from scrapy.item import BaseItem from scrapy.utils.request import request_fingerprint from myproject.items import MyItem class IgnoreVisitedItems(object): """Middleware to ignore re-visiting item pages if they were already visited before. The requests to be filtered by have a meta['filter_visited'] flag enabled and optionally define an id to use for identifying them, which defaults the request fingerprint, although you'd want to use the item id, if you already have it beforehand to make it more robust. """ FILTER_VISITED = 'filter_visited' VISITED_ID = 'visited_id' CONTEXT_KEY = 'visited_ids' def process_spider_output(self, response, result, spider): context = getattr(spider, 'context', {}) visited_ids = context.setdefault(self.CONTEXT_KEY, {}) ret = [] for x in result: visited = False if isinstance(x, Request): if self.FILTER_VISITED in x.meta: visit_id = self._visited_id(x) if visit_id in visited_ids: log.msg("Ignoring already visited: %s" % x.url, level=log.INFO, spider=spider) visited = True elif isinstance(x, BaseItem): visit_id = self._visited_id(response.request) if visit_id: visited_ids[visit_id] = True x['visit_id'] = visit_id x['visit_status'] = 'new' if visited: ret.append(MyItem(visit_id=visit_id, visit_status='old')) else: ret.append(x) return ret def _visited_id(self, request): return request.meta.get(self.VISITED_ID) or request_fingerprint(request)
希望本文所述对大家的Python程序设计有所帮助。
免责声明:本站文章均来自网站采集或用户投稿,网站不提供任何软件下载或自行开发的软件!
如有用户或公司发现本站内容信息存在侵权行为,请邮件告知! 858582#qq.com
暂无“Python自定义scrapy中间模块避免重复采集的方法”评论...
更新动态
2025年01月10日
2025年01月10日
- 小骆驼-《草原狼2(蓝光CD)》[原抓WAV+CUE]
- 群星《欢迎来到我身边 电影原声专辑》[320K/MP3][105.02MB]
- 群星《欢迎来到我身边 电影原声专辑》[FLAC/分轨][480.9MB]
- 雷婷《梦里蓝天HQⅡ》 2023头版限量编号低速原抓[WAV+CUE][463M]
- 群星《2024好听新歌42》AI调整音效【WAV分轨】
- 王思雨-《思念陪着鸿雁飞》WAV
- 王思雨《喜马拉雅HQ》头版限量编号[WAV+CUE]
- 李健《无时无刻》[WAV+CUE][590M]
- 陈奕迅《酝酿》[WAV分轨][502M]
- 卓依婷《化蝶》2CD[WAV+CUE][1.1G]
- 群星《吉他王(黑胶CD)》[WAV+CUE]
- 齐秦《穿乐(穿越)》[WAV+CUE]
- 发烧珍品《数位CD音响测试-动向效果(九)》【WAV+CUE】
- 邝美云《邝美云精装歌集》[DSF][1.6G]
- 吕方《爱一回伤一回》[WAV+CUE][454M]