Java之網絡爬蟲WebCollector+selenium+phantomjs(二)


上一篇做小例子的時候,在獲取頁面上價格的時候發現,獲取不到,查了下說是webcollector需要結合selenium與phantomjs來獲取js生成的動態。下面就做個例子來學習。

准備材料在上一篇已經准備完畢,我是在windows系統上進行的測試,所以phantomjs運行環境下載phantomjs-windows下載即可,下載后解壓到某個文件夾即可(可以把解壓路徑添加到環境變量里,如果沒有加到環境變量那么在啟動的時候需要加上路徑)。selenium與phantomjs的jar包都在上一篇的pom文件里。

下面貼出代碼:

PageUtisl.Java

/*
* Copyright (C) 2015 zhao
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*/
package com.zhao.crawler.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;

import com.gargoylesoftware.htmlunit.BrowserVersion;

import cn.edu.hfut.dmic.webcollector.model.Page;

/**
*
*
* @author <a href="ls.zhaoxiangyu@gmail.com">zhao</>
* @date 2015-10-22
*/
public class PageUtils {

/**
* 獲取webcollector 自帶 htmlUnitDriver實例(模擬默認瀏覽器)
*
* @param page
* @return
*/
public static HtmlUnitDriver getDriver(Page page) {
HtmlUnitDriver driver = new HtmlUnitDriver();
driver.setJavascriptEnabled(true);
driver.get(page.getUrl());
return driver;
}

/**
* 獲取webcollector 自帶htmlUnitDriver實例
*
* @param page
* @param browserVersion 模擬瀏覽器
* @return
*/
public static HtmlUnitDriver getDriver(Page page,
BrowserVersion browserVersion) {
HtmlUnitDriver driver = new HtmlUnitDriver(browserVersion);
driver.setJavascriptEnabled(true);
driver.get(page.getUrl());
return driver;
}

/**
* 獲取PhantomJsDriver(可以爬取js動態生成的html)
*
* @param page
* @return
*/
public static WebDriver getWebDriver(Page page) {
// WebDriver driver = new HtmlUnitDriver(true);

// System.setProperty("webdriver.chrome.driver", "D:\\Installs\\Develop\\crawling\\chromedriver.exe");
// WebDriver driver = new ChromeDriver();

System.setProperty("phantomjs.binary.path", "D:/Program Files/phantomjs-2.0.0-windows/bin/phantomjs.exe");
WebDriver driver = new PhantomJSDriver();
driver.get(page.getUrl());

// JavascriptExecutor js = (JavascriptExecutor) driver;
// js.executeScript("function(){}");
return driver;
}

/**
* 直接調用原生phantomJS(即不通過selenium)
*
* @param page
* @return
*/
public static String getPhantomJSDriver(Page page) {
Runtime rt = Runtime.getRuntime();
Process process = null;
try {
process = rt.exec("D:/Program Files/phantomjs-2.0.0-windows/bin/phantomjs.exe" +
"D:/MyEclipseWorkSpace/WebCollectorDemo/src/main/resources/parser.js " +
page.getUrl().trim());
InputStream in = process.getInputStream();
InputStreamReader reader = new InputStreamReader(
in, "UTF-8");
BufferedReader br = new BufferedReader(reader);
StringBuffer sbf = new StringBuffer();
String tmp = "";
while((tmp = br.readLine())!=null){
sbf.append(tmp);
}
return sbf.toString();
} catch (IOException e) {
e.printStackTrace();
}

return null;
}
}
上面代碼是獲取html頁面的driver方法(框架自帶的HtmlUnitDriver和第三方phantomjs driver)。

DemoJsCrawler.java

/*
* Copyright (C) 2015 zhao
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
*/
package com.zhao.crawler.demo;

import java.util.List;

import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.zhao.crawler.util.PageUtils;

import cn.edu.hfut.dmic.webcollector.crawler.DeepCrawler;
import cn.edu.hfut.dmic.webcollector.model.Links;
import cn.edu.hfut.dmic.webcollector.model.Page;

/**
*
*
* @author <a href="ls.zhaoxiangyu@gmail.com">zhao</>
* @date 2015-10-22
*/
public class DemoJSCrawler extends DeepCrawler {

public DemoJSCrawler(String crawlPath) {
super(crawlPath);
}

@Override
public Links visitAndGetNextLinks(Page page) {
//HtmlUnitDriver
// handleByHtmlUnitDriver(page);
//PhantomJsDriver
handleByPhantomJsDriver(page);
return null;
}

/**
* webcollector自帶獲取html driver測試
*
* @param page
*/
protected void handleByHtmlUnitDriver(Page page){
/*HtmlUnitDriver可以抽取JS生成的數據*/
HtmlUnitDriver driver=PageUtils.getDriver(page,BrowserVersion.CHROME);
/*HtmlUnitDriver也可以像Jsoup一樣用CSS SELECTOR抽取數據
關於HtmlUnitDriver的文檔請查閱selenium相關文檔*/
print(driver);
}

/**
* phantomjs driver測試
*
* @param page
*/
protected void handleByPhantomJsDriver(Page page){
WebDriver driver=PageUtils.getWebDriver(page);
print(driver);
driver.quit();
}

protected void print(WebDriver driver){
List<WebElement> divInfos = driver.findElements(By.cssSelector("li.gl-item"));
for(WebElement divInfo:divInfos){
WebElement price=divInfo.findElement(By.className("J_price"));
System.out.println(price+":"+price.getText());
}
}
public static void main(String[] args) throws Exception{
DemoJSCrawler crawler=new DemoJSCrawler("D:/test/crawler/jd/");
crawler.addSeed("http://list.jd.com/list.html?cat=1319,1523,7052&page=1&go=0&JL=6_0_0");
crawler.start(1);
}

}

這個類繼承了DeepCrawler,里面主要有三個方法:

print()方法:打印獲取的價格信息,格式為:價格所在元素:價格。

我打算捕獲商品列表頁面上的價格,用瀏覽器審查元素如下圖所示

:
商品位於列表標簽li class為gl-item的元素里面,再深入到里面價格在class 為J_price的strong標簽里面。按照這個邏輯,我們先獲取列表元素,然后再定位到價格標簽,打印的時候格式為:價格所在標簽:價格。

方法中的price.getText()是獲取標簽中間或子標簽的文本內容(會自動過濾掉標簽而獲取其中的文本內容)。具體請查看selenium-api-2.12.chm

handleByHtmlUnitDriver():這個方法是用自帶的driver獲取頁面。

下面運行程序,打印結果如下:

只打印了定位的標簽,但是沒有打印價格(通過查看 頁面的js文件發現,京東的價格是js動態加載到頁面上的)。證明webcollector 自帶的html dirver對處理js動態生成的html頁面並不友好。所以這時候就需要另外兩個框架selenium與phantomjs配合抓取js動態生成的html頁面。

handleByPhantomJsDriver()方法為通過PhantomJsDriver來獲取html頁面,運行結果如下所示:


通過日志看到成功獲取到價格。下一篇將會做一個爬取京東商品列表信息的小例子,並且提供源碼下載。上一篇提供各種資源下載及環境搭建。







注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2020 ITdaan.com