各種網站分析方法的優缺點是什么?

[英]What are the pros and cons of various ways of analyzing websites?


I'd like to write some code which looks at a website and its assets and creates some stats and a report. Assets would include images. I'd like to be able to trace links, or at least try to identify menus on the page. I'd also like to take a guess at what CMS created the site, based on class names and such.

我想編寫一些代碼來查看網站及其資產,並創建一些統計數據和報告。資產將包括圖像。我希望能夠跟蹤鏈接,或者至少嘗試識別頁面上的菜單。我還想根據類名等來猜測CMS創建網站的內容。

I'm going to assume that the site is reasonably static, or is driven by a CMS, but is not something like an RIA.

我將假設該站點相當靜態,或者由CMS驅動,但不像RIA。

Ideas about how I might progress.

關於我如何進步的想法。

1) Load site into an iFrame. This would be nice because I could parse it with jQuery. Or could I? Seems like I'd be hampered by cross-site scripting rules. I've seen suggestions to get around those problems, but I'm assuming browsers will continue to clamp down on such things. Would a bookmarklet help?

1)將站點加載到iFrame中。這很好,因為我可以用jQuery解析它。或者我可以嗎?似乎我受到跨站點腳本規則的阻礙。我已經看到了解決這些問題的建議,但我認為瀏覽器會繼續限制這些問題。書簽有用嗎?

2) A Firefox add-on. This would let me get around the cross-site scripting problems, right? Seems doable, because debugging tools for Firefox (and GreaseMonkey, for that matter) let you do all kinds of things.

2)Firefox附加組件。這可以讓我解決跨站點腳本問題,對吧?似乎可行,因為Firefox(以及GreaseMonkey)的調試工具可以讓你做各種各樣的事情。

3) Grab the site on the server side. Use libraries on the server to parse.

3)抓住服務器端的站點。使用服務器上的庫進行解析。

4) YQL. Isn't this pretty much built for parsing sites?

4)YQL。這不是為解析網站而構建的嗎?

7 个解决方案

#1


That really depends on the scale of your project. If it’s just casual, not fully automated, I’d strongly suggest a Firefox Addon.

這實際上取決於項目的規模。如果它只是隨意的,而不是完全自動化的,我強烈建議使用Firefox Addon。

I’m right in the middle of similar project. It has to analyze the DOM of a page generated using Javascript. Writing a server-side browser was too difficult, so we turned to some other technologies: Adobe AIR, Firefox Addons, userscripts, etc.

我正處於類似項目的中間。它必須分析使用Javascript生成的頁面的DOM。編寫服務器端瀏覽器太困難了,所以我們轉向其他一些技術:Adobe AIR,Firefox Addons,userscripts等。

Fx addon is great, if you don’t need the automation. A script can analyze the page, show you the results, ask you to correct the parts, that it is uncertain of and finally post the data to some backend. You have access to all of the DOM, so you don’t need to write a JS/CSS/HTML/whatever parser (that would be hell of a job!)

如果您不需要自動化,Fx插件很棒。腳本可以分析頁面,顯示結果,要求您更正部分,不確定並最終將數據發布到某些后端。您可以訪問所有DOM,因此您不需要編寫JS / CSS / HTML /任何解析器(這將是一份工作!)

Another way is Adobe AIR. Here, you have more control over the application — you can launch it in the background, doing all the parsing and analyzing without your interaction. The downside is — you don’t have access to all DOM of the pages. The only way to go pass this is to set up a simple proxy, that fetches target URL, adds some Javascript (to create a trusted-untrusted sandbox bridge)… It’s a dirty hack, but it works.

另一種方式是Adobe AIR。在這里,您可以更好地控制應用程序 - 您可以在后台啟動它,無需交互即可進行所有解析和分析。缺點是 - 您無法訪問所有頁面的DOM。傳遞這個的唯一方法是設置一個簡單的代理,獲取目標URL,添加一些Javascript(創建一個受信任的不受信任的沙箱橋)...這是一個骯臟的黑客,但它的工作原理。

Edit: In Adobe AIR, there are two ways to access a foreign website’s DOM:

編輯:在Adobe AIR中,有兩種方法可以訪問外部網站的DOM:

  • Load it via Ajax, create HTMLLoader object, and feed the response into it (loadString method IIRC)
  • 通過Ajax加載它,創建HTMLLoader對象,並將響應提供給它(loadString方法IIRC)

  • Create an iframe, and load the site in untrusted sandbox.
  • 創建iframe,並在不受信任的沙箱中加載網站。

I don’t remember why, but the first method failed for me, so I had to use the other one (i think there was some security reasons involved, that I couldn’t workaround). And I had to create a sandbox, to access site’s DOM. Here’s a bit about dealing with sandbox bridges. The idea is to create a proxy, that adds a simple JS, that creates childSandboxBridge and exposes some methods to the parent (in this case: the AIR application). The script contents is something like:

我不記得為什么,但第一種方法對我失敗了,所以我不得不使用另一種方法(我認為有一些安全原因,我無法解決)。我必須創建一個沙箱,以訪問網站的DOM。這里有一些關於處理沙箱橋的問題。我們的想法是創建一個代理,它添加一個簡單的JS,創建childSandboxBridge並向父代公開一些方法(在這種情況下:AIR應用程序)。腳本內容類似於:

window.childSandboxBridge = {
   // ... some methods returning data
}

(be careful — there are limitations of what can be passed via the sandbox bridge — no complex objects for sure! use only the primitive types)

(小心 - 通過沙箱橋傳遞的內容有局限性 - 當然沒有復雜的對象!只使用原始類型)

So, the proxy basically tampered with all the requests that returned HTML or XHTML. All other was just passed through unchanged. I’ve done this using Apache + PHP, but could be done with a real proxy with some plugins/custom modules for sure. This way I had the access to DOM of any site.

因此,代理基本上篡改了返回HTML或XHTML的所有請求。所有其他人都沒有改變。我已經使用Apache + PHP完成了這項工作,但可以通過一些真正的代理來完成,其中包含一些插件/自定義模塊。這樣我就可以訪問任何站點的DOM。

end of edit.

編輯結束。

The third way I know of, the hardest way — set up an environment similar to those on browsershots. Then you’re using firefox with automation. If you have a Mac OS X on a server, you could play with ActionScript, to do the automation for you.

我所知道的第三種方式,最難的方式 - 建立一個類似於瀏覽器上的環境。然后你使用firefox自動化。如果您在服務器上安裝了Mac OS X,則可以使用ActionScript進行自動化。

So, to sum up:

  • PHP/server-side script — you have to implement your own browser, JS engine, CSS parser, etc, etc. Fully under control and automated instead.
  • PHP /服務器端腳本 - 您必須實現自己的瀏覽器,JS引擎,CSS解析器等,完全控制和自動化。

  • Firefox Addon — has access to DOM and all stuff. Requires user to operate it (or at least an open firefox session with some kind of autoreload). Nice interface for a user to guide the whole process.
  • Firefox Addon - 可以訪問DOM和所有內容。需要用戶操作它(或者至少是一個帶有某種自動重載的開放式firefox會話)。用戶指導整個過程的界面很好。

  • Adobe AIR — requires a working desktop computer, more difficult than creating a Fx addon, but more powerful.
  • Adobe AIR - 需要一台可運行的台式計算機,比創建Fx插件更困難,但功能更強大。

  • Automated browser — more of a desktop programming issue that webdevelopment. Can be set up on a linux terminal without graphical environment. Requires master hacking skills. :)
  • 自動瀏覽器 - 更多的是webdevelopment的桌面編程問題。可以在沒有圖形環境的linux終端上設置。需要掌握黑客技能。 :)

#2


My suggestion would be:

我的建議是:

a) Chose a scripting language. I suggest Perl or Python: also curl+bash but it bad no exception handling.

a)選擇腳本語言。我建議使用Perl或Python:curl + bash但是它沒有異常處理。

b) Load the home page via a script, using a python or perl library. Try Perl WWW::Mechanize module.

b)使用python或perl庫通過腳本加載主頁。試試Perl WWW :: Mechanize模塊。

Python has plenty of built-in module, try a look also at www.feedparser.org

Python有很多內置模塊,請查看www.feedparser.org

c) Inspect the server header (via the HTTP HEAD command) to find application server name. If you are lucky you will also find the CMS name (i.d. WordPress, etc).

c)檢查服務器頭(通過HTTP HEAD命令)以查找應用程序服務器名稱。如果幸運的話,您還可以找到CMS名稱(i.d.WordPress等)。

d) Use Google XML API to ask something like "link:sitedomain.com" to find out links pointing to the site: again you will find code examples for Python on google home page. Also asking domain ranking to Google can be helpful.

d)使用Google XML API詢問類似“link:sitedomain.com”的內容,找出指向該網站的鏈接:再次,您將在google主頁上找到Python的代碼示例。同時向Google提出域名排名也很有幫助。

e)You can collect the data in a SQLite db, then post process them in Excel.

e)您可以在SQLite數據庫中收集數據,然后在Excel中進行后處理。

#3


You should simply fetch the source (XHTML/HTML) and parse it. You can do that in almost any modern programming language. From your own computer that is connected to Internet.

您應該只是獲取源(XHTML / HTML)並解析它。你幾乎可以用任何現代編程語言來做到這一點。從您自己的連接到Internet的計算機。

iframe is a widget for displaying HTML content, it's not a technology for data analysis. You can analyse data without displaying it anywhere. You don't even need a browser.

iframe是用於顯示HTML內容的小部件,它不是用於數據分析的技術。您可以分析數據而無需在任何地方顯示。你甚至不需要瀏覽器。

Tools in languages like Python, Java, PHP are certainly more powerful for your tasks than Javascript or whatever you have in those Firefox extensions.

Python,Java,PHP等語言中的工具對於您的任務而言肯定比Javascript或這些Firefox擴展中的任何功能更強大。

It also does not matter what technology is behind the website. XHTML/HTML is just a string of characters no matter how a browser renders it. To find your "assets" you will simply look for specific HTML tags like "img", "object" etc.

網站背后的技術也無關緊要。無論瀏覽器如何渲染,XHTML / HTML都只是一串字符。要查找“資產”,您只需查找特定的HTML標記,例如“img”,“object”等。

#4


I think an writing an extension to Firebug would proabably be one of the easiest way to do with. For instance YSlow has been developed on top of Firebug and it provides some of the features you're looking for (e.g. image, CSS and Javascript-summaries).

我認為寫一個Firebug的擴展可能是最簡單的方法之一。例如,YSlow是在Firebug之上開發的,它提供了一些您正在尋找的功能(例如圖像,CSS和Javascript摘要)。

#5


I suggest you try option #4 first (YQL): The reason being that it looks like this might get you all the data you need and you could then build your tool as a website or such where you could get info about a site without actually having to go to the page in your browser. If YQL works for what you need, then it looks like you'd have the most flexibility with this option.

我建議你首先嘗試選項#4(YQL):原因是看起來這可能會得到你需要的所有數據,然后你可以將你的工具構建為一個網站或者你可以獲得有關網站的信息必須在瀏覽器中轉到該頁面。如果YQL適合您的需要,那么看起來您對此選項的靈活性最大。

If YQL doesn't pan out, then I suggest you go with option #2 (a firefox addon).

如果YQL沒有成功,那么我建議你選擇#2(一個firefox插件)。

I think you should probably try and stay away from Option #1 (the Iframe) because of the cross-site scripting issues you already are aware of.

我認為您應該嘗試遠離選項#1(Iframe),因為您已經了解了跨站點腳本問題。

Also, I have used Option #3 (Grab the site on the server side) and one problem I've ran into in the past is the site being grabbed loading content after the fact using AJAX calls. At the time I didn't find a good way to grab the full content of pages that use AJAX - SO BE WARY OF THAT OBSTACLE! Other people here have ran into that also, see this: Scrape a dynamic website

此外,我使用了選項#3(在服務器端抓住網站),我過去遇到的一個問題是在使用AJAX調用之后,網站被抓取加載內容。當時我沒有找到一個很好的方法來獲取使用AJAX的頁面的全部內容 - 所以不要擔心這個障礙!這里的其他人也碰到了,看到這個:刮一個動態的網站

THE AJAX DYNAMIC CONTENT ISSUE: There may be some solutions to the ajax issue, such as using AJAX itself to grab the content and using the evalScripts:true parameter. See the following articles for more info and an issue you might need to be aware of with how evaluated javascript from the content being grabbed works:

AJAX動態內容問題:ajax問題可能有一些解決方案,例如使用AJAX本身來獲取內容並使用evalScripts:true參數。請參閱以下文章以獲取更多信息以及您可能需要了解的問題,以及從被抓取的內容中評估的javascript如何工作:

Prototype library: http://www.prototypejs.org/api/ajax/updater

原型庫:http://www.prototypejs.org/api/ajax/updater

Message Board: http://www.crackajax.net/forums/index.php?action=vthread&forum=3&topic=17

留言板:http://www.crackajax.net/forums/index.php?action = vthread&forum = 3&topic = 17

Or if you are willing to spend money, take a look at this: http://aptana.com/jaxer/guide/develop_sandbox.html

或者,如果您願意花錢,請看看:http://aptana.com/jaxer/guide/develop_sandbox.html

Here is an ugly (but maybe useful) example of using a .NET component called WebRobot to scrap content from a dynamic AJAX enabled site such as Digg.com. http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx

這是一個丑陋的(但也許是有用的)使用名為WebRobot的.NET組件來廢棄動態支持AJAX的站點(如Digg.com)中的內容的示例。 http://www.vbdotnetheaven.com/UploadFile/fsjr/ajaxwebscraping09072006000229AM/ajaxwebscraping.aspx

Also here is a general article on using PHP and the Curl library to scrap all the links from a web page. However, I'm not sure if this article and the Curl library covers the AJAX content issue: http://www.merchantos.com/makebeta/php/scraping-links-with-php/

這里還有一篇關於使用PHP和Curl庫來刪除網頁中所有鏈接的一般文章。但是,我不確定這篇文章和Curl庫是否涵蓋了AJAX內容問題:http://www.merchantos.com/makebeta/php/scraping-links-with-php/

One thing I just thought of that might work is:

我剛想到的一件事可能有用:

  1. grab the content and evaluate it using AJAX.
  2. 抓取內容並使用AJAX進行評估。

  3. send the content to your server.
  4. 將內容發送到您的服務器。

  5. evaluate the page, links, etc..
  6. 評估頁面,鏈接等。

  7. [OPTIONAL] save the content as a local page on your server .
  8. [可選]將內容保存為服務器上的本地頁面。

  9. return the statistics info back to the page.
  10. 將統計信息返回到頁面。

  11. [OPTIONAL] display cached local version with highlighting.
  12. [可選]顯示緩存的本地版本並突出顯示。

^Note: If saving a local version, you will want to use regular expressions to convert relative link paths (for images especially) to be correct.

^注意:如果保存本地版本,則需要使用正則表達式來轉換相對鏈接路徑(尤其是圖像)。

Good luck! Just please be aware of the AJAX issue. Many sites nowadays load content dynamically using AJAX. Digg.com does, MSN.com does for it's news feeds, etc...

祝好運!請注意AJAX問題。現在許多站點使用AJAX動態加載內容。 Digg.com,MSN.com為它的新聞提供等...

#6


Being primarily a .Net programmer these days, my advice would be to use C# or some other language with .Net bindings. Use the WebBrowser control to load the page, and then iterate through the elements in the document (via GetElementsByTagName()) to get links, images, etc. With a little extra work (parsing the BASE tag, if available), you can resolve src and href attributes into URL's and use the HttpWebRequest to send HEAD requests for the target images to determine their sizes. That should give you an idea of how graphically intensive the page is, if that's something you're interested in. Additional items you might be interested in including in your stats could include backlinks / pagerank (via Google API), whether the page validates as HTML or XHTML, what percentage of links link to URL's in the same domain versus off-site, and, if possible, Google rankings for the page for various search strings (dunno if that's programmatically available, though).

現在主要是.Net程序員,我的建議是使用C#或其他語言與.Net綁定。使用WebBrowser控件加載頁面,然后遍歷文檔中的元素(通過GetElementsByTagName())以獲取鏈接,圖像等。通過一些額外的工作(解析BASE標記,如果可用),您可以解決src和href屬性到URL中,並使用HttpWebRequest發送目標圖像的HEAD請求以確定它們的大小。這應該讓您了解頁面的圖形密集程度,如果這是您感興趣的內容。您可能感興趣的其他項目(包括您的統計信息)可能包括反向鏈接/ pagerank(通過Google API),無論頁面是否驗證為HTML或XHTML,鏈接到同一域中的URL與異地之間的鏈接的百分比,以及(如果可能的話)各種搜索字符串的頁面的谷歌排名(盡管如果編程可用,則為dunno)。

#7


I would use a script (or a compiled app depending on language of choice) written in a language that has strong support for networking and text parsing/regular expressions.

我會使用一種語言編寫的腳本(或根據所選語言編譯的應用程序),該語言對網絡和文本解析/正則表達式有很強的支持。

  • Perl
  • Python
  • .NET language of choice
  • .NET語言的選擇

  • Java

whatever language you are most comfortable with. A basic stand alone script/app keeps you from needing to worry too much about browser integration and security issues.

無論你最熟悉的語言。基本的獨立腳本/應用程序使您無需過多擔心瀏覽器集成和安全問題。


注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:https://www.itdaan.com/blog/2009/05/28/725bdd31ee55321f1ed617dbea6d595.html



 
粤ICP备14056181号  © 2014-2021 ITdaan.com