JS改变DOM后如何刮掉一些东西?

[英]How do I scrape something after JS has changed the DOM?


I'm using Mechanize, although I'm open to Nokogiri if Mechanize can't do it.

我正在使用Mechanize,虽然如果Mechanize不能这样做,我会对Nokogiri开放。

I'd like to scrape the page after all the scripts have loaded as opposed to beforehand.

我想在加载所有脚本之后将页面刮掉,而不是事先加载。

How might I do this?

我怎么能这样做?

4 个解决方案

#1


4  

Nokogiri and Mechanize are not full web browsers and do not run JavaScript in a browser-model DOM. You want to use something like Watir or Selenium which allow you to use Ruby to control an actual web browser.

Nokogiri和Mechanize不是完整的Web浏览器,也不在浏览器模型DOM中运行JavaScript。您希望使用Watir或Selenium之类的东西,它允许您使用Ruby来控制实际的Web浏览器。

#2


6  

I think a good option is something like this with Nokogiri, Watir, and PhantomJs:

我认为Nokogiri,Watir和PhantomJs是一个很好的选择:

b = Watir::Browser.new(:phantomjs)

b = Watir :: Browser.new(:phantomjs)

b.goto URL

doc = Nokogiri::HTML(b.html)

doc = Nokogiri :: HTML(b.html)

The resulting doc will be from when after the scripts have been loaded. And phantomjs is nice because there is no need to load a browser.

生成的文档将从脚本加载后的时间开始。而phantomjs很不错,因为不需要加载浏览器。

#3


2  

In addition to watir-webdriver and capybara-webkit, celerity is a good option although it is jruby only.

除了watir-webdriver和capybara-webkit之外,快速是一个很好的选择,虽然它只是jruby。

#4


0  

I don't know anything about mechanize or nokogiri so I can't comment specifically on them. However, the issue of getting JavaScript after it's been modified is one I believe can only be solved with more JavaScript. In order to get the newly generated HTML you would need to get the .innerHTML of the document element. This can be tricky since you would have to inject js into a page.

我对机械化或nokogiri一无所知,所以我不能专门评论它们。但是,在修改JavaScript之后获取JavaScript的问题是我认为只能通过更多JavaScript来解决的问题。为了获得新生成的HTML,您需要获取文档元素的.innerHTML。这可能很棘手,因为您必须将js注入页面。

The only way I know of to accomplish this is to write a FireFox plugin. With a plugin you can run JavaScript on a page even though it's not your page. Sorry I'm not more help, I hope that this helps to put you on the right path.

我知道完成此任务的唯一方法是编写FireFox插件。使用插件,您可以在页面上运行JavaScript,即使它不是您的页面。对不起,我没有更多的帮助,我希望这有助于让你走上正确的道路。

If you're interested in plug-ins this is one place to start:http://anthonystechblog.wordpress.com/category/internet/firefox/

如果您对插件感兴趣,可以从这里开始:http://anthonystechblog.wordpress.com/category/internet/firefox/

智能推荐

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.itdaan.com/blog/2012/05/15/720262f5c210c540e22dc7ed5efc036c.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告