Scrapy笔记

Word count: 524 / Reading time: 3 min

 2018/04/10   Share

0x00 基础

scrapy startproject <project_name> [project_dir]

一个爬取前两页名言并保存在html的scrapy

import scrapy
class QuotesSpider(scrapy.Spider):
	name="quotes"

	def start_requests(self):
		urls=['http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',]
        for url in urls:
        	yield scrapy.Request(url=url,callback=self.prase)

    def pares(self,response):
    	page = response.url.split("/")[-2]
    	filename = 'quotes-%s.html' % page
    	with open(filename,'wb') as f:
    		f.write(response.body)
    	self.log('Saved file %s' % filename)

运行这个scrapy scrapy crwal quotes。
name必须的，每个spider的name不能重复。
start_request()返回请求，spider会从中抓取内容，后续请求也从这些请求中生成
parse()将被调用来处理为每个请求下载的响应的方法。响应参数是TextResponse保存页面内容的一个实例。
page = response.url.split("/")这里以/分割，返回的是一个列表。这里的[-2]会对返回的列表进行索引，选取倒数第二项。
不用start_requests()

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',]
    def parse(self,response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s' % page
        with open(filename,'wb') as f:
            f.write(response.body)

0x01提取数据的练习

scrapy shell http://quotes.toscrape.com/page/1/

response.css(‘title’).extract()
[‘Quotes to Scrape‘]

response.css(‘title::text’).extract_first()
‘Quotes to Scrape’

response.css(‘title::text’)[0].extract()
‘Quotes to Scrape’

response.css(‘title::text’).re(r’Quotes.*‘)
[‘Quotes to Scrape’]

0x03保存数据

scrapy crawl quotes -o quotes.jl这将生成一个quotes.json包含所有抓取的项目的文件，并以JSON序列化.

0x04实现翻页

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

这里说一下urljoin()函数，有个比较好理解的 https://www.cnblogs.com/phil-chow/p/5347947.html

这里可以将next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) 换成 yield response.fllow(next_parse,callback=self.parse)

原文作者: Hu3sky

原文链接: http://yoursite.com/2018/04/10/scrapy笔记/

发表日期: April 10th 2018, 12:00:00 am

Next Post

1个月学习计划
Previous Post

rnpnet tiki-calendar.php页面存在代码执行漏洞

CATALOG

1. 0x00 基础
2. 0x01提取数据的练习
3. 0x03保存数据
4. 0x04实现翻页

