Hu3sky's blog

Scrapy笔记

Word count: 524 / Reading time: 3 min
2018/04/10 Share

0x00 基础

scrapy startproject <project_name> [project_dir]

一个爬取前两页名言并保存在html的scrapy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import scrapy
class QuotesSpider(scrapy.Spider):
name="quotes"

def start_requests(self):
urls=['http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',]
for url in urls:
yield scrapy.Request(url=url,callback=self.prase)

def pares(self,response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename,'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

  • 运行这个scrapy scrapy crwal quotes

  • name必须的,每个spider的name不能重复。

  • start_request()返回请求,spider会从中抓取内容,后续请求也从这些请求中生成
  • parse()将被调用来处理为每个请求下载的响应的方法。响应参数是TextResponse保存页面内容的一个实例。
  • page = response.url.split("/")这里以/分割,返回的是一个列表。这里的[-2]会对返回的列表进行索引,选取倒数第二项。
    不用start_requests()
1
2
3
4
5
6
7
8
9
10
11
12
13
import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',]
def parse(self,response):
page = response.url.split("/")[-2]
filename = 'quotes-%s' % page
with open(filename,'wb') as f:
f.write(response.body)

0x01提取数据的练习

scrapy shell http://quotes.toscrape.com/page/1/

response.css(‘title’).extract()
[‘Quotes to Scrape‘]

response.css(‘title::text’).extract_first()
‘Quotes to Scrape’

response.css(‘title::text’)[0].extract()
‘Quotes to Scrape’

response.css(‘title::text’).re(r’Quotes.*‘)
[‘Quotes to Scrape’]

0x03保存数据

scrapy crawl quotes -o quotes.jl这将生成一个quotes.json包含所有抓取的项目的文件,并以JSON序列化.

0x04实现翻页

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import scrapy


class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}

next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

这里说一下urljoin()函数,有个比较好理解的 https://www.cnblogs.com/phil-chow/p/5347947.html

这里可以将next_page = response.urljoin(next_page) yield scrapy.Request(next_page, callback=self.parse) 换成 yield response.fllow(next_parse,callback=self.parse)

CATALOG
  1. 1. 0x00 基础
  2. 2. 0x01提取数据的练习
  3. 3. 0x03保存数据
  4. 4. 0x04实现翻页