JS改變DOM后如何刮掉一些東西?

[英]How do I scrape something after JS has changed the DOM?


I'm using Mechanize, although I'm open to Nokogiri if Mechanize can't do it.

我正在使用Mechanize,雖然如果Mechanize不能這樣做,我會對Nokogiri開放。

I'd like to scrape the page after all the scripts have loaded as opposed to beforehand.

我想在加載所有腳本之后將頁面刮掉,而不是事先加載。

How might I do this?

我怎么能這樣做?

4 个解决方案

#1


4  

Nokogiri and Mechanize are not full web browsers and do not run JavaScript in a browser-model DOM. You want to use something like Watir or Selenium which allow you to use Ruby to control an actual web browser.

Nokogiri和Mechanize不是完整的Web瀏覽器,也不在瀏覽器模型DOM中運行JavaScript。您希望使用Watir或Selenium之類的東西,它允許您使用Ruby來控制實際的Web瀏覽器。

#2


6  

I think a good option is something like this with Nokogiri, Watir, and PhantomJs:

我認為Nokogiri,Watir和PhantomJs是一個很好的選擇:

b = Watir::Browser.new(:phantomjs)

b = Watir :: Browser.new(:phantomjs)

b.goto URL

doc = Nokogiri::HTML(b.html)

doc = Nokogiri :: HTML(b.html)

The resulting doc will be from when after the scripts have been loaded. And phantomjs is nice because there is no need to load a browser.

生成的文檔將從腳本加載后的時間開始。而phantomjs很不錯,因為不需要加載瀏覽器。

#3


2  

In addition to watir-webdriver and capybara-webkit, celerity is a good option although it is jruby only.

除了watir-webdriver和capybara-webkit之外,快速是一個很好的選擇,雖然它只是jruby。

#4


0  

I don't know anything about mechanize or nokogiri so I can't comment specifically on them. However, the issue of getting JavaScript after it's been modified is one I believe can only be solved with more JavaScript. In order to get the newly generated HTML you would need to get the .innerHTML of the document element. This can be tricky since you would have to inject js into a page.

我對機械化或nokogiri一無所知,所以我不能專門評論它們。但是,在修改JavaScript之后獲取JavaScript的問題是我認為只能通過更多JavaScript來解決的問題。為了獲得新生成的HTML,您需要獲取文檔元素的.innerHTML。這可能很棘手,因為您必須將js注入頁面。

The only way I know of to accomplish this is to write a FireFox plugin. With a plugin you can run JavaScript on a page even though it's not your page. Sorry I'm not more help, I hope that this helps to put you on the right path.

我知道完成此任務的唯一方法是編寫FireFox插件。使用插件,您可以在頁面上運行JavaScript,即使它不是您的頁面。對不起,我沒有更多的幫助,我希望這有助於讓你走上正確的道路。

If you're interested in plug-ins this is one place to start:http://anthonystechblog.wordpress.com/category/internet/firefox/

如果您對插件感興趣,可以從這里開始:http://anthonystechblog.wordpress.com/category/internet/firefox/


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2012/05/15/720262f5c210c540e22dc7ed5efc036c.html



 
粤ICP备14056181号  © 2014-2020 ITdaan.com