爬蟲那些事兒


隨機更換user-agent

每次url請求更換一次user-agent

1
pip install fake - useragent

settings

1
2
3
4
DOWNLOADER_MIDDLEWARES  =  {
    # 'ArticleSpider.middlewares.MyCustomDownloaderMiddleware': 543,
       'ArticleSpider.middlewares.RandomUserAgentMiddleware' 400 ,
}

middlewares 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from  fake_useragent  import  UserAgent
 
class  RandomUserAgentMiddleware( object ):
     def  __init__( self , crawler):
         super (RandomUserAgentMiddleware,  self ).__init__()
 
         self .ua  =  UserAgent()
         # 若settings中沒有設置RANDOM_UA_TYPE的值默認值為random,
         # 從settings中獲取RANDOM_UA_TYPE變量,值可以是 random ie chrome firefox safari opera msie
         self .ua_type  =  crawler.settings.get( 'RANDOM_UA_TYPE' 'random'
 
     @classmethod
     def  from_crawler( cls , crawler):
         return  cls (crawler)
 
     def  process_request( self , request, spider):
         def  get_ua():
             '''根據settings的RANDOM_UA_TYPE變量設置每次請求的User-Agent'''
             return  getattr ( self .ua,  self .ua_type)
 
         ua  =  get_ua()
         request.headers.setdefault( 'User-Agent' , get_ua())

 

ip代理

方案一:免費版

自定義函數獲取網上的一些免費代理ip

settings

1
2
3
DOWNLOADER_MIDDLEWARES  =  {
       'ArticleSpider.middlewares.RandomProxyMiddleware' 400 ,
}

middlewares 

1
2
3
4
class  RandomProxyMiddleware( object ):
     #動態設置ip代理
     def  process_request( self , request, spider):
         request.meta[ "proxy" =  get_random_ip()  # 這個自定義函數返回一個隨機代理ip:port

方案二:收費版

github上scrapy-proxies等等

在線打碼

 編碼識別:由於驗證碼識別難度大,而且易更新,所以編碼識別驗證碼(不推薦)

 在線打碼:調用已經開發好的在線驗證碼識別軟件接口識別驗證碼。識別率在90%以上,並且效率高(推薦)

 人工打碼:識別率近100%,但是成本高(用於復雜的)

cookie禁用

一些網站會跟蹤cookie,如果不需要登陸的網站,可禁用cookie,降低被ban概率,scrapy默認開啟cookie

1
COOKIES_ENABLED  =  False

自動限速

調整某些參數,如

1
2
AUTOTHROTTLE_ENABLED  =  True
DOWNLOAD_DELAY  =  3

selenium 

官方文檔 http://selenium-python-docs-zh.readthedocs.io/zh_CN/latest/

作用:瀏覽器操控

安裝selenium

1
pip install selenium

下載對應瀏覽器的驅動  

http://selenium-python.readthedocs.io/installation.html

第三方(微博)登錄知乎

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import  time
from  selenium  import  webdriver
from  scrapy.selector  import  Selector
 
browser  =  webdriver.Chrome(executable_path = "D:/Package/chromedriver.exe" )
time.sleep( 2 )   # 延時為了讓頁面加載完
 
browser.get( "https://www.zhihu.com/#signin" )
browser.find_element_by_css_selector( ".qrcode-signin-cut-button" ).click()
browser.find_element_by_css_selector( ".signup-social-buttons" ).click()
browser.find_element_by_css_selector( ".js-bindweibo" ).click()
#browser.switch_to.window(browser.window_handles[-1])
browser.find_element_by_css_selector( ".WB_iptxt" ).send_keys( "xxx" )
browser.find_element_by_css_selector( "input[node-type='passwd']" ).send_keys( "xxx" )
browser.find_element_by_css_selector( "a[node-type='submit']" ).click()
time.sleep( 2 # 延時為了讓頁面加載完
browser.find_element_by_css_selector( "a[node-type='submit']" ).click()

第三方(QQ)登錄知乎

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# -*- coding: utf-8 -*-
__author__  =  'hy'
import  time
from  selenium  import  webdriver
from  scrapy.selector  import  Selector
  
browser  =  webdriver.Firefox(executable_path = "D:/Package/geckodriver.exe" )
#
browser.get( "https://www.zhihu.com/#signin" )
time.sleep( 2 )
  
# 點擊QQ
browser.find_element_by_css_selector( ".qrcode-signin-cut-button" ).click()
browser.find_element_by_css_selector( ".signup-social-buttons" ).click()
time.sleep( 2 )
browser.find_element_by_css_selector( ".js-bindqq" ).click()
time.sleep( 5 )
  
browser.switch_to.window(browser.window_handles[ - 1 ])
browser.switch_to.frame( "ptlogin_iframe" )   # iframe必須逐級切入
  
# 用戶名 密碼
 
# 隱藏初始界面
browser.execute_script( 'document.getElementById("qlogin").style="display: none;"' )
browser.execute_script( 'document.getElementsByClassName("authLogin").style="display: none;"' )
# 顯示用戶、密碼輸入界面
browser.execute_script( 'document.getElementById("web_qr_login").style="display: block;"' )
# browser.evaluate_script('document.getElementById("batch_quto").contentEditable = true')
time.sleep( 5 )
  
# 輸入用戶、密碼
elem_user  =  browser.find_element_by_name( "u" ).send_keys( "xxx" )
elem_pwd  =  browser.find_element_by_name( "p" ).send_keys( "xxx" )
elem_but  =  browser.find_element_by_id( "login_button" ).click()
time.sleep( 5 )

scrapy集成selenium  

為什么集成selenium

selenium取代下載器,編碼難度大的操作交給selenium

優點:反爬蟲難度大

缺點:同步selenium效率低,需要結合Twisted成異步

middleware方式

方式一

settings

1
2
3
DOWNLOADER_MIDDLEWARES  =  {
       'ArticleSpider.middlewares.JSPageMiddleware' : 1 ,
}

middlewares   

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from  selenium  import  webdriver
from  scrapy.http  import  HtmlResponse
import  time
 
 
class  JSPageMiddleware( object ):
     def  __init__( self ):  # 使用同一個self,保證只打開一個瀏覽器,所有spider使用一個瀏覽器
         self .browser  =  webdriver.Chrome(executable_path = "D:/Package/chromedriver.exe" )
         super (JSPageMiddleware,  self ).__init__()
 
     # 通過chrome請求動態網頁
     def  process_request( self , request, spider):
         if  spider.name  = =  "jobbole" :
             # self.browser = webdriver.Chrome(executable_path="D:/Package/chromedriver.exe")
             self .browser.get(request.url)
             time.sleep( 1 )
             print ( "訪問:{0}" . format (request.url))
             # browser.quit()
             return  HtmlResponse(url = self .browser.current_url, body = self .browser.page_source,
                                 encoding = "utf-8" , request = request)

方式二

middlewares 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from  scrapy.http  import  HtmlResponse
import  time
 
class  JSPageMiddleware( object ):
     # 通過chrome請求動態網頁
     def  process_request( self , request, spider):
         if  spider.name  = =  "jobbole" :
             # self.browser = webdriver.Chrome(executable_path="D:/Package/chromedriver.exe")
             spider.browser.get(request.url)
             time.sleep( 1 )
             print ( "訪問:{0}" . format (request.url))
             # browser.quit()
             return  HtmlResponse(url = spider.browser.current_url, body = spider.browser.page_source,
                                 encoding = "utf-8" , request = request)

spider

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from  selenium  import  webdriver
from  scrapy.xlib.pydispatch  import  dispatcher
from  scrapy  import  signals
 
class  JobboleSpider(scrapy.Spider):
     name  =  'jobbole'
     allowed_domains  =  [ 'blog.jobbole.com' ]
     start_urls  =  [ 'http://blog.jobbole.com/all-posts/' ]
 
     def  __init__( self ):  # 使用同一個self,每個spider使用一個瀏覽器
         self .browser  =  webdriver.Chrome(executable_path = "D:/Package/chromedriver.exe" )
         super (JobboleSpider,  self ).__init__()
         dispatcher.connect( self .spider_closed, signals.spider_closed)   # 爬蟲關閉后
 
     def  spider_closed( self , spider):
         self .browser.quit()

scrapy集成selenium模擬登錄

為什么不直接用selenium替代原生下載器?

selenium是同步的方式,如果每個頁面采用selenium則導致爬蟲效率極低,目前並沒有scrapy中的Twisted結合selenium的異步方案,因此selenium不推薦替代原生下載器

scrapy集成selenium能做什么?

 由於模擬登錄是編碼很難解決的問題 ,因此采用selenium解決;其它頁面繼續用原生下載器的異步下載方案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# -*- coding: utf-8 -*-
import  re
import  datetime
  
try :
     import  urlparse as parse
except :
     from  urllib  import  parse
  
import  scrapy
from  selenium  import  webdriver
import  time
  
class  ZhihuSpider(scrapy.Spider):
     name  =  "zhihu"
     allowed_domains  =  [ "www.zhihu.com" ]
     start_urls  =  [ 'https://www.zhihu.com/' ]
     login_cookies  =  []
 
     headers  =  {
         "HOST" "www.zhihu.com" ,
         "Referer" "https://www.zhizhu.com" ,
         'User-Agent' "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
     }
  
     # selenium登錄保存cookies
     def  get_cookies( self ):
         browser  =  webdriver.Chrome(executable_path = "D:/Package/chromedriver.exe" )
         time.sleep( 2 )   # 延時為了讓頁面加載完
  
         browser.get( "https://www.zhihu.com/#signin" )
         browser.find_element_by_css_selector( ".qrcode-signin-cut-button" ).click()
         browser.find_element_by_css_selector( ".signup-social-buttons" ).click()
         browser.find_element_by_css_selector( ".js-bindweibo" ).click()
         # browser.switch_to.window(browser.window_handles[-1])
         browser.find_element_by_css_selector( ".WB_iptxt" ).send_keys( "xxx" )
         browser.find_element_by_css_selector( "input[node-type='passwd']" ).send_keys( "xxx" )
         browser.find_element_by_css_selector( "a[node-type='submit']" ).click()
         time.sleep( 2 )   # 延時為了讓頁面加載完
         browser.find_element_by_css_selector( "a[node-type='submit']" ).click()
         login_cookies  =  browser.get_cookies()
         browser.close()
  
     # 第一步:先於parse方法執行,處理登陸邏輯。可以猜測,start_requests攜帶的cookie會給后續所有的訪問自動帶上
     def  start_requests( self ):
         return  [scrapy.Request( 'https://www.zhihu.com/#signin' , headers = self .headers, cookies = self .login_cookies,
                                callback = self .parse)]
  
     # 第二步:處理登陸后的邏輯
     def  parse( self , response):
         my_url =  'https://www.zhihu.com/people/edit'   # 該頁面是個人中心頁,只有登錄后才能訪問
         yield  scrapy.Request(my_url, headers = self .headers)

爬取知乎文章和問答  

scrapy shell調試  

1
2
scrapy shell  - s USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
https: / / www.zhihu.com / question / 56320032

 頁面分析  

chrome安裝jsonview插件

xhr頁面查看json數據,這樣獲取數據更輕松

 

表設計

為了避免可能解析不到的字段或無法插入的情況,需要給字段設置默認值

 

 

  settings
  item
  pipeline
  spider

scrapy-redis分布式爬蟲

優點:利用多台機器的寬帶加速爬取,利用多台機器的ip加速爬取(單台機器需要限速防止ip被ban)

缺點:編碼難度大於單機爬蟲

分布式需要解決的問題 

requests隊列集中管理

去重集中管理  

windows安裝redis

1
https: / / github.com / MicrosoftArchive / redis / releases

創建項目  

1
scrapy startproject ScrapyRedisTest

scrapy-redis:  https://github.com/rmax/scrapy-redis  

scrapy-redis源碼分析  

復制代碼
import redis


# For standalone use.
DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'

PIPELINE_KEY = '%(spider)s:items'

REDIS_CLS = redis.StrictRedis
REDIS_ENCODING = 'utf-8'
# Sane connection defaults.
REDIS_PARAMS = {
    'socket_timeout': 30,
    'socket_connect_timeout': 30,
    'retry_on_timeout': True,
    'encoding': REDIS_ENCODING,
}

SCHEDULER_QUEUE_KEY = '%(spider)s:requests'
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

START_URLS_KEY = '%(name)s:start_urls'
START_URLS_AS_SET = False
復制代碼
復制代碼
import six

from scrapy.utils.misc import load_object

from . import defaults


# Shortcut maps 'setting name' -> 'parmater name'.
SETTINGS_PARAMS_MAP = {
    'REDIS_URL': 'url',
    'REDIS_HOST': 'host',
    'REDIS_PORT': 'port',
    'REDIS_ENCODING': 'encoding',
}


def get_redis_from_settings(settings):
    """Returns a redis client instance from given Scrapy settings object.

    This function uses ``get_client`` to instantiate the client and uses
    ``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You
    can override them using the ``REDIS_PARAMS`` setting.

    Parameters
    ----------
    settings : Settings
        A scrapy settings object. See the supported settings below.

    Returns
    -------
    server
        Redis client instance.

    Other Parameters
    ----------------
    REDIS_URL : str, optional
        Server connection URL.
    REDIS_HOST : str, optional
        Server host.
    REDIS_PORT : str, optional
        Server port.
    REDIS_ENCODING : str, optional
        Data encoding.
    REDIS_PARAMS : dict, optional
        Additional client parameters.

    """
    # 把settings文件的配置和defaults配置更新到params
    params = defaults.REDIS_PARAMS.copy()
    params.update(settings.getdict('REDIS_PARAMS'))
    # XXX: Deprecate REDIS_* settings.
    for source, dest in SETTINGS_PARAMS_MAP.items():
        val = settings.get(source)
        if val:
            params[dest] = val

    # Allow ``redis_cls`` to be a path to a class.
    if isinstance(params.get('redis_cls'), six.string_types):
        params['redis_cls'] = load_object(params['redis_cls'])

    return get_redis(**params)  # 調用get_redis


# get_redis_from_settings函數的別名:from_settings,從這里可以知道這個文件是准備給其它文件調用的(這里沒用。。)
# Backwards compatible alias.
from_settings = get_redis_from_settings


# 連接redis
def get_redis(**kwargs):
    """Returns a redis client instance.

    Parameters
    ----------
    redis_cls : class, optional
        Defaults to ``redis.StrictRedis``.
    url : str, optional
        If given, ``redis_cls.from_url`` is used to instantiate the class.
    **kwargs
        Extra parameters to be passed to the ``redis_cls`` class.

    Returns
    -------
    server
        Redis client instance.

    """
    redis_cls = kwargs.pop('redis_cls', defaults.REDIS_CLS)
    url = kwargs.pop('url', None)
    if url:
        return redis_cls.from_url(url, **kwargs)
    else:
        return redis_cls(**kwargs)

注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2021 ITdaan.com