1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
| __________________Command Prompt____________________ scrapy shell https://www.dushu.com/book/1107.html
_______________________Scrapy Shell______________________
In [1]: from scrapy.linkextractors import LinkExtractor
In [2]: link = LinkExtractor(allow=r'/book/1107_\d+\.html')
In [4]: link.extract_links(response) Out[4]: [Link(url='https://www.dushu.com/book/1107_2.html', text='2', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_3.html', text='3', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_4.html', text='4', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_5.html', text='5', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_6.html', text='6', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_7.html', text='7', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_8.html', text='8', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_9.html', text='9', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_10.html', text='10', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_11.html', text='11', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_12.html', text='12', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_13.html', text='13', fragment='', nofollow=False)]
In [6]: link1 = LinkExtractor(restrict_xpaths=r'//div[@class="pages"]/a')
In [7]: link1.extract_links(response) Out[7]: [Link(url='https://www.dushu.com/book/1107_2.html', text='2', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_3.html', text='3', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_4.html', text='4', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_5.html', text='5', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_6.html', text='6', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_7.html', text='7', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_8.html', text='8', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_9.html', text='9', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_10.html', text='10', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_11.html', text='11', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_12.html', text='12', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_13.html', text='13', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_2.html', text='下一页»', fragment='', nofollow=False)]
In [8]: link.extract_links(response) Out[8]: [Link(url='https://www.dushu.com/book/1107_2.html', text='2', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_3.html', text='3', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_4.html', text='4', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_5.html', text='5', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_6.html', text='6', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_7.html', text='7', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_8.html', text='8', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_9.html', text='9', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_10.html', text='10', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_11.html', text='11', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_12.html', text='12', fragment='', nofollow=False), Link(url='https://www.dushu.com/book/1107_13.html', text='13', fragment='', nofollow=False)]
|