Nutch抓取過程簡析


Nutch的數據文件: 
crawldb: 爬行數據庫,用來存儲所要爬行的網址。 
linkdb: 鏈接數據庫,用來存儲每個網址的鏈接地址,包括源地址和鏈接地址。 
segments: 抓取的網址被作為一個單元,而一個segment就是一個單元。

 

 

crawldb

 

crawldb中存放的是url地址,第一次根據所給url  :http://blog.tianya.cn進行注入,然后update crawldb 保存第一次抓取的url地址,下一次即depth=2的時候就會從crawldb中獲取新的url地址集,進行新一輪的抓取。

crawldb中有兩個文件夾:current 和old.  current就是當前url地址集,old是上一次的一個備份。每一次生成新的,都會把原來的改為old。 
current和old結構相同 里面都有part-00000這樣的一個文件夾(local方式下只有1個) 在part-00000里面分別有data和index兩個文件。一個存放數據,一個存放索引。

 

另外Nutch也提供了對crawldb文件夾狀態查看命令(readdb):

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb 
Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>) 
    <crawldb>    directory name where crawldb is located 
    -stats [-sort]     print overall statistics to System.out 
        [-sort]    list status sorted by host 
    -dump <out_dir> [-format normal|csv|crawldb]    dump the whole db to a text file in <out_dir> 
        [-format csv]    dump in Csv format 
        [-format normal]    dump in standard format (default option) 
        [-format crawldb]    dump as CrawlDB 
        [-regex <expr>]    filter records with expression 
        [-status <status>]    filter records by CrawlDatum status 
    -url <url>    print information on <url> to System.out 
    -topN <nnnn> <out_dir> [<min>]    dump top <nnnn> urls sorted by score to <out_dir> 
        [<min>]    skip records with scores below this value. 
            This can significantly improve performance. 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -stats 
CrawlDb statistics start: ./data/crawldb 
Statistics for CrawlDb: ./data/crawldb 
TOTAL urls:    2520 
retry 0:    2520 
min score:    0.0 
avg score:    8.8253967E-4 
max score:    1.014 
status 1 (db_unfetched):    2346 
status 2 (db_fetched):    102 
status 3 (db_gone):    1 
status 4 (db_redir_temp):    67 
status 5 (db_redir_perm):    4 
CrawlDb statistics: done

說明: 
-stats命令是一個快速查看爬取信息的很有用的命令:

TOTAL urls:表示當前在crawldb中的url數量。 
db_unfetched:鏈接到已爬取頁面但還沒有被爬取的頁面數(原因是它們沒有通過url過濾器的過濾,或者包括在了TopN之外被Nutch丟棄)
db_gone:表示發生了404錯誤或者其他一些臆測的錯誤,這種狀態阻止了對其以后的爬取工作。 
db_fetched:表示已爬取和索引的頁面,如果其值為0,那肯定出錯了。 
db_redir_temp和db_redir_perm分別表示臨時重定向和永久重定向的頁面。

min score、avg score、max score是分值算法的統計值,是網頁重要性的依據,這里暫且不談。

 

此外,還可以通過readdb的dump命令將crawldb中內容輸出到文件中進行查看:

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -dump crawl_tianya_out 
CrawlDb dump: starting 
CrawlDb db: ./data/crawldb 
CrawlDb dump: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls ./crawl_tianya_out/ 
part-00000 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ less ./crawl_tianya_out/part-00000

http://100w.tianya.cn/  Version: 7 
Status: 1 (db_unfetched) 
Fetch time: Sun Dec 08 21:42:34 CST 2013 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 1.3559322E-5 
Signature: null 
Metadata:

http://aimin_001.blog.tianya.cn/        Version: 7 
Status: 4 (db_redir_temp) 
Fetch time: Tue Jan 07 21:38:13 CST 2014 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 0.016949153 
Signature: null 
Metadata: Content-Type: text/html_pst_: temp_moved(13), lastModified=0:http://blog.tianya.cn/blogger/blog_main.asp?BlogID=134876

http://alice.tianya.cn/ Version: 7 
Status: 1 (db_unfetched) 
Fetch time: Sun Dec 08 21:42:34 CST 2013 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 3.3898305E-6 
Signature: null 
Metadata:

http://anger.blog.tianya.cn/    Version: 7 
Status: 4 (db_redir_temp) 
Fetch time: Tue Jan 07 21:38:13 CST 2014 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 0.016949153 
Signature: null 
Metadata: Content-Type: text/html_pst_: temp_moved(13), lastModified=0:http://blog.tianya.cn/blogger/blog_main.asp?BlogID=219280


………………

從上面內容可以看到,里面保存了狀態,抓取的時間,修改時間,有效期,分值,指紋,頭數據等詳細關於抓取的內容。

 

也可以使用url命令查看某個具體url的信息:

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -url http://zzbj.tianya.cn/ 
URL: http://zzbj.tianya.cn/ 
Version: 7 
Status: 1 (db_unfetched) 
Fetch time: Sun Dec 08 21:42:34 CST 2013 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 7.6175966E-6 
Signature: null 
Metadata:

 

segments

 

每一個segments都是一組被作為一個單元來獲取的URL。segments是它本身這個目錄以及它下面的子目錄:

  • 一個crawl_generate確定了將要被獲取的一組URL;
  • 一個crawl_fetch包含了獲取的每個URL的狀態;
  • 一個content包含了從每個URL獲取回來的原始的內容;
  • 一個parse_text包含了每個URL解析以后的文本;
  • 一個parse_data包含來自每個URL被解析后內容中的外鏈和元數據;
  • 一個crawl_parse包含了外鏈的URL,用來更新crawldb。

 

這里要穿插一下,通過查看nohup.out最后內容時,發現出現異常問題:


hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ tail -n 50 nohup.out 
Parsed (1ms):http://www.tianya.cn/52491364 
Parsed (0ms):http://www.tianya.cn/55086751 
Parsed (0ms):http://www.tianya.cn/73398397 
Parsed (0ms):http://www.tianya.cn/73792451 
Parsed (0ms):http://www.tianya.cn/74299859 
Parsed (0ms):http://www.tianya.cn/76154565 
Parsed (0ms):http://www.tianya.cn/81507846 
Parsed (0ms):http://www.tianya.cn/9887577 
Parsed (0ms):http://www.tianya.cn/mobile/ 
Parsed (1ms):http://xinzhi.tianya.cn/ 
Parsed (0ms):http://yuqing.tianya.cn/ 
ParseSegment: finished at 2013-12-08 21:42:24, elapsed: 00:00:07 
CrawlDb update: starting at 2013-12-08 21:42:24 
CrawlDb update: db: data/crawldb 
CrawlDb update: segments: [data/segments/20131208213957] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: 404 purging: false 
CrawlDb update: Merging segment data into db. 
CrawlDb update: finished at 2013-12-08 21:42:37, elapsed: 00:00:13 
LinkDb: starting at 2013-12-08 21:42:37 
LinkDb: linkdb: data/linkdb 
LinkDb: URL normalize: true 
LinkDb: URL filter: true 
LinkDb: internal links will be ignored. 
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208213957 
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208211101 
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208213723 
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208213806 
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20131208211101/parse_data 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) 
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) 
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:989) 
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:981) 
    at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) 
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:396) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) 
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850) 
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) 
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:180) 
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:151) 
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$

 

如上日志信息,出現此問題的原因和上一篇筆記中出現的http.agent.name問題有關,因http.agent.name問題出現異常,但仍然生成了相應的空文件目錄。 解決方式也很簡單,刪除報錯的文件夾即可。

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls ./data/segments/ 
20131208211101  20131208213723  20131208213806  20131208213957 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ rm -rf data/segments/20131208211101 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls ./data/segments/ 
20131208213723  20131208213806  20131208213957 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$

因為我們執行時的depth是3,一次爬行中每次循環都會產生一個segment,所以當前看到的是三個文件目錄,Segment是有時限的,當這些網頁被Crawler重新抓取后,先前抓取產生的segment就作廢了。在存儲中,Segment文件夾是以產生時間命名的,方便我們刪除作廢的segments以節省存儲空間。我們可以看下每個文件目錄下有哪些內容:

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls data/segments/20131208213723/ 
content  crawl_fetch  crawl_generate  crawl_parse  parse_data  parse_text

可以看到,一個segment包括以下子目錄(多是二進制格式):

content:包含每個抓取頁面的內容

crawl_fetch:包含每個抓取頁面的狀態 
crawl_generate:包含所抓取的網址列表 
crawl_parse:包含網址的外部鏈接地址,用於更新crawldb數據庫 
parse_data:包含每個頁面的外部鏈接和元數據 
parse_text:包含每個抓取頁面的解析文本

 

每個文件的生成時間

1.crawl_generate在Generator的時候生成; 
2.content,crawl_fetch在Fetcher的時候生成; 
3.crawl_parse,parse_data,parse_text在Parse segment的時候生成。

 

如何查看每個文件的內容呢,如想查看content中抓取的網頁源碼內容,這個在本文后面會有介紹。

 

 

linkdb

 

linkdb: 鏈接數據庫,用來存儲每個網址的鏈接地址,包括源地址和鏈接地址。

由於http.agent.name原因,linkdb中內容插入失敗,我重新執行了下爬蟲命令,下面是執行完成后nohup.out中末尾日志信息:

。。。。。。

ParseSegment: finished at 2014-01-11 16:30:22, elapsed: 00:00:07 
CrawlDb update: starting at 2014-01-11 16:30:22 
CrawlDb update: db: data/crawldb 
CrawlDb update: segments: [data/segments/20140111162513] 
CrawlDb update: additions allowed: true 
CrawlDb update: URL normalizing: true 
CrawlDb update: URL filtering: true 
CrawlDb update: 404 purging: false 
CrawlDb update: Merging segment data into db. 
CrawlDb update: finished at 2014-01-11 16:30:35, elapsed: 00:00:13 
LinkDb: starting at 2014-01-11 16:30:35 
LinkDb: linkdb: data/linkdb 
LinkDb: URL normalize: true 
LinkDb: URL filter: true 
LinkDb: internal links will be ignored. 
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20140111162237 
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20140111162320 
LinkDb: adding segment: file:/home/hu/data/nutch/release-1.6/runtime/local/data/segments/20140111162513 
LinkDb: finished at 2014-01-11 16:30:48, elapsed: 00:00:13 
crawl finished: data

 

現在可以查看linkdb中內容信息了

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readlinkdb ./data/linkdb -dump crawl_tianya_out_linkdb 
LinkDb dump: starting at 2014-01-11 16:39:42 
LinkDb dump: db: ./data/linkdb 
LinkDb dump: finished at 2014-01-11 16:39:49, elapsed: 00:00:07 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls ./crawl_tianya_out_linkdb/ 
part-00000 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ head -n 10 ./crawl_tianya_out_linkdb/part-00000 
http://100w.tianya.cn/    Inlinks: 
fromUrl: http://star.tianya.cn/ anchor: [2012第一美差] 
fromUrl: http://star.tianya.cn/ anchor: 2013第一美差

http://aimin_001.blog.tianya.cn/    Inlinks: 
fromUrl: http://blog.tianya.cn/blog/mingbo anchor: 長沙艾敏 
fromUrl: http://blog.tianya.cn/ anchor: 長沙艾敏

http://alice.tianya.cn/    Inlinks: 
fromUrl: http://bj.tianya.cn/ anchor: 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$

可以看到有的網頁有多個Inlinks,這說明網頁的重要性越大。和分值的確定有直接關系。比如一個網站的首頁就會有很多的Inlinks。

 

 

 

其他信息查看:

 

1.根據需要可以通過命令查看抓取運行的相關訊息

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ cat nohup.out | grep elapsed 
Injector: finished at 2013-12-08 21:10:53, elapsed: 00:00:14 
Generator: finished at 2013-12-08 21:11:08, elapsed: 00:00:15 
Injector: finished at 2013-12-08 21:37:15, elapsed: 00:00:17 
Generator: finished at 2013-12-08 21:37:30, elapsed: 00:00:15 
Fetcher: finished at 2013-12-08 21:37:37, elapsed: 00:00:07 
ParseSegment: finished at 2013-12-08 21:37:45, elapsed: 00:00:07 
CrawlDb update: finished at 2013-12-08 21:37:58, elapsed: 00:00:13 
Generator: finished at 2013-12-08 21:38:13, elapsed: 00:00:15 
Fetcher: finished at 2013-12-08 21:39:29, elapsed: 00:01:16 
ParseSegment: finished at 2013-12-08 21:39:36, elapsed: 00:00:07 
CrawlDb update: finished at 2013-12-08 21:39:49, elapsed: 00:00:13 
Generator: finished at 2013-12-08 21:40:04, elapsed: 00:00:15 
Fetcher: finished at 2013-12-08 21:42:17, elapsed: 00:02:13 
ParseSegment: finished at 2013-12-08 21:42:24, elapsed: 00:00:07 
CrawlDb update: finished at 2013-12-08 21:42:37, elapsed: 00:00:13 
Injector: finished at 2014-01-11 16:22:29, elapsed: 00:00:14 
Generator: finished at 2014-01-11 16:22:45, elapsed: 00:00:15 
Fetcher: finished at 2014-01-11 16:22:52, elapsed: 00:00:07 
ParseSegment: finished at 2014-01-11 16:22:59, elapsed: 00:00:07 
CrawlDb update: finished at 2014-01-11 16:23:12, elapsed: 00:00:13 
Generator: finished at 2014-01-11 16:23:27, elapsed: 00:00:15 
Fetcher: finished at 2014-01-11 16:24:48, elapsed: 00:01:21 
ParseSegment: finished at 2014-01-11 16:24:55, elapsed: 00:00:07 
CrawlDb update: finished at 2014-01-11 16:25:05, elapsed: 00:00:10 
Generator: finished at 2014-01-11 16:25:20, elapsed: 00:00:15 
Fetcher: finished at 2014-01-11 16:30:15, elapsed: 00:04:54 
ParseSegment: finished at 2014-01-11 16:30:22, elapsed: 00:00:07 
CrawlDb update: finished at 2014-01-11 16:30:35, elapsed: 00:00:13 
LinkDb: finished at 2014-01-11 16:30:48, elapsed: 00:00:13

 

2.查看segments目錄下content內容,上面我們提到content內容是抓取時網頁的源碼內容,但因為是二進制的無法直接查看,不過nutch提供了相應的查看方式:

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg 
Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]

* General options: 
    -nocontent    ignore content directory 
    -nofetch    ignore crawl_fetch directory 
    -nogenerate    ignore crawl_generate directory 
    -noparse    ignore crawl_parse directory 
    -noparsedata    ignore parse_data directory 
    -noparsetext    ignore parse_text directory

* SegmentReader -dump <segment_dir> <output> [general options] 
  Dumps content of a <segment_dir> as a text file to <output>.

    <segment_dir>    name of the segment directory. 
    <output>    name of the (non-existent) output directory.

* SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options] 
  List a synopsis of segments in specified directories, or all segments in 
  a directory <segments>, and print it on System.out

    <segment_dir1> ...    list of segment directories to process 
    -dir <segments>        directory that contains multiple segments

* SegmentReader -get <segment_dir> <keyValue> [general options] 
  Get a specified record from a segment, and print it on System.out.

    <segment_dir>    name of the segment directory. 
    <keyValue>    value of the key (url). 
        Note: put double-quotes around strings with spaces.

 

查看 content:

content包含了從每個URL獲取回來的原始的內容。 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237 ./data/crawl_tianya_seg_content -nofetch -nogenerate -noparse -noparsedata -noparsetext 
SegmentReader: dump segment: data/segments/20140111162237 
SegmentReader: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls ./data/crawl_tianya_seg_content/ 
dump       .dump.crc  
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_content/dump

Recno:: 0 
URL:: http://blog.tianya.cn/

Content:: 
Version: -1 
url: http://blog.tianya.cn/ 
base: http://blog.tianya.cn/ 
contentType: text/html 
metadata: Date=Sat, 11 Jan 2014 08:22:46 GMT Vary=Accept-Encoding Expires=Thu, 01 Nov 2012 10:00:00 GMT Content-Encoding=gzip nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20140111162237 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx Cache-Control=no-cache Pragma=no-cache 
Content:

<!DOCTYPE HTML> 
<html> 
<head> 
<meta charset="utf-8"> 
<title>天涯博客_有見識的人都在此</title> 
<meta name="keywords" content="天涯,博客,天涯博客,天涯社區,天涯論壇,意見領袖" /> 
<meta name="description" content="天涯博客是天涯社區開辦的獨立博客平台,這里可以表達網民立場,聚集意見領袖,眾多草根精英以他們的觀點影響社會的進程。天涯博客,有見識的人都在此!" />

<link href="http://static.tianyaui.com/global/ty/TY.css" rel="stylesheet" type="text/css" /> 
<link href="http://static.tianyaui.com/global/blog/web/static/css/blog_56de4ad.css" rel="stylesheet" type="text/css" /> 
<link rel="shortcut icon" href="http://static.tianyaui.com/favicon.ico" type="image/vnd.microsoft.icon" /> 
<script type="text/javascript" charset="utf-8" src="http://static.tianyaui.com/global/ty/TY.js"></script> 
<!--[if lt IE 7]> 
  <script src="http://static.tianyaui.com/global/ty/util/image/DD_belatedPNG_0.0.8a.js?v=2013101509"type="text/javascript"></script> 
<![endif]--> 
</head> 
<body> 
<div id="huebox" > 
    
<script type="text/javascript" charset="utf-8">TY.loader("TY.ui.nav",function(){TY.ui.nav.init ({app_str:'blog',topNavWidth: 1000,showBottomNav:false});});</script>

<div id="blogdoc" class="blogdoc blogindex"> 
    <div id="hd"></div> 
    <div id="bd" class="layout-lmr clearfix"> 
        <div id="left"> 
            
            
<div class="sub-nav left-mod"> 
    <ul class="text-list-2"> 
        <li class="curr"><a class="ico-1" href="http://blog.tianya.cn/">博客首頁</a></li> 
        <li class=""><a href="/blog/society">社會民生</a></li> 
        <li class=""><a href="/blog/international">國際觀察</a></li> 
        <li class=""><a href="/blog/ent">娛樂</a></li> 
        <li class=""><a href="/blog/sports">體育</a></li> 
        <li class=""><a href="/blog/culture">文化</a></li> 
        <li class=""><a href="/blog/history">歷史</a></li> 
        <li class=""><a href="/blog/life">生活</a></li> 
        <li class=""><a href="/blog/emotion">情感</a></li> 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$

 

我們也可以采取同樣的方式查看其他文件內容,如crawl_fetch,parse_data等。

查看crawl_fetch:

crawl_fetch包含了獲取的每個URL的狀態。

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_fetch -nocontent -nogenerate -noparse -noparsedata -noparsetext 
SegmentReader: dump segment: data/segments/20140111162237 
SegmentReader: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_fetch/dump

Recno:: 0 
URL:: http://blog.tianya.cn/

CrawlDatum:: 
Version: 7 
Status: 33 (fetch_success) 
Fetch time: Sat Jan 11 16:22:46 CST 2014 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 1.0 
Signature: null 
Metadata: _ngt_: 1389428549880Content-Type: text/html_pst_: success(1), lastModified=0

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$

 

查看crawl_generate:

crawl_generate確定了將要被獲取的一組URL。

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_generate -nocontent -nofetch -noparse -noparsedata -noparsetext 
SegmentReader: dump segment: data/segments/20140111162237 
SegmentReader: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_generate/dump

Recno:: 0 
URL:: http://blog.tianya.cn/

CrawlDatum:: 
Version: 7 
Status: 1 (db_unfetched) 
Fetch time: Sat Jan 11 16:22:15 CST 2014 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 1.0 
Signature: null 
Metadata: _ngt_: 1389428549880

 

查看crawl_parse:

crawl_parse包含了外鏈的URL,用來更新crawldb。

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_parse -nofetch -nogenerate -nocontent –noparsedata –noparsetext 
SegmentReader: dump segment: data/segments/20140111162237 
SegmentReader: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_parse/dump

Recno:: 0 
URL:: http://aimin_001.blog.tianya.cn/

CrawlDatum:: 
Version: 7 
Status: 67 (linked) 
Fetch time: Sat Jan 11 16:22:55 CST 2014 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 0.016949153 
Signature: null 
Metadata:


Recno:: 1 
URL:: http://anger.blog.tianya.cn/

CrawlDatum:: 
Version: 7 
Status: 67 (linked) 
Fetch time: Sat Jan 11 16:22:55 CST 2014 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 0.016949153 
Signature: null 
Metadata:

。。。。

 

查看parse_data:

parse_data包含來自每個URL被解析后內容中的外鏈和元數據。

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg –dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_data -nofetch -nogenerate -nocontent -noparse –noparsetext 
SegmentReader: dump segment: data/segments/20140111162237 
SegmentReader: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_data/dump

Recno:: 0 
URL:: http://blog.tianya.cn/

ParseData:: 
Version: 5 
Status: success(1,0) 
Title: 天涯博客_有見識的人都在此 
Outlinks: 59 
  outlink: toUrl: http://blog.tianya.cn/blog/society anchor: 社會民生 
  outlink: toUrl: http://blog.tianya.cn/blog/international anchor: 國際觀察 
  outlink: toUrl: http://blog.tianya.cn/blog/ent anchor: 娛樂 
  outlink: toUrl: http://blog.tianya.cn/blog/sports anchor: 體育 
  outlink: toUrl: http://blog.tianya.cn/blog/culture anchor: 文化 
  outlink: toUrl: http://blog.tianya.cn/blog/history anchor: 歷史 
  outlink: toUrl: http://blog.tianya.cn/blog/life anchor: 生活 
  outlink: toUrl: http://blog.tianya.cn/blog/emotion anchor: 情感 
  outlink: toUrl: http://blog.tianya.cn/blog/finance anchor: 財經 
  outlink: toUrl: http://blog.tianya.cn/blog/stock anchor: 股市 
  outlink: toUrl: http://blog.tianya.cn/blog/food anchor: 美食 
  outlink: toUrl: http://blog.tianya.cn/blog/travel anchor: 旅游 
  outlink: toUrl: http://blog.tianya.cn/blog/newPush anchor: 最新博文 
  outlink: toUrl: http://blog.tianya.cn/blog/mingbo anchor: 天涯名博 
  outlink: toUrl: http://blog.tianya.cn/blog/daren anchor: 博客達人 
  outlink: toUrl: http://www.tianya.cn/mobile anchor: 
  outlink: toUrl: http://bbs.tianya.cn/post-1018-1157-1.shtml anchor: 天涯“2013年度十大深度影響力博客”名單 
  outlink: toUrl: http://jingyibaobei.blog.tianya.cn/ anchor: 煙花少爺 
  outlink: toUrl: http://lljjasmine.blog.tianya.cn/ anchor: 尋夢的冰蝶

。。。。

 

查看parse_text:

parse_text包含了每個URL解析以后的文本。

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/crawl_tianya_seg_text -nofetch -nogenerate -nocontent -noparse -noparsedata 
SegmentReader: dump segment: data/segments/20140111162237 
SegmentReader: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ head -n 50 ./data/crawl_tianya_seg_text/dump

Recno:: 0 
URL:: http://blog.tianya.cn/

ParseText:: 
天涯博客_有見識的人都在此 博客首頁 社會民生 國際觀察 娛樂 體育 文化 歷史 生活 情感 財經 股市 美食 旅游 最新博文 天涯名博 博客達人 博客總排行 01 等待溫暖的小狐狸 44887595 02 潘文偉 34654676 03 travelisliving 30676532 04 股市掘金 28472831 05 crystalkitty 26283927 06 yuwenyufen 24880887 07 水莫然 24681174 08 李澤輝 22691445 09 鍾巍巍 19226129 10 別境 17752691 11 微笑的說我很幸 15912882 12 尤宇 15530802 13 sundaes 14961321 14 鄭渝川 14219498 15 黑花黃 13174656 博文排行 01 任志強戳穿“央視十宗罪”都 02 野雲先生:錢眼里的文化(5 03 是美女博士征男友還是媒體博 04 黃牛永遠走在時代的最前沿 05 “與女優度春宵”怎成員工年 06 如何看待對張藝謀罰款748萬 07 女保姆酒后色誘我上床被妻撞 08 年過不惑的男人為何對婚姻也 09 明代變態官員囚多名尼姑做性 10 女人不肯承認的20個秘密 社會排行 國際排行 01 風青楊:章子怡“七億陪睡案 02 潘金雲和她的腦癱孩子們。 03 人民大學前校長紀寶成腐敗之 04 小學語文課本配圖錯誤不是小 05 閑聊“北京地鐵要漲價” 06 “高壓”整治火患之后,還該 07 警惕父母誤導孩子的十種不良 08 一代名伶紅線女為什么如此紅 09 黎明:應明令禁止官員技偵發 10 官二代富二代的好運氣不能獨 01 【環球熱點】如此奢華—看了 02 阿基諾的民,阿基諾的心,阿 03 “中國向菲律賓捐款10萬美元 04 美國法律界:對青少年犯罪的 05 一語中的:諾貝爾獎得主銳評 06 300萬元保證金騙到武漢公司1 07 亂而取之的智慧 08 中國連宣泄憤怒都有人“代表 09 世界啊,請醒醒吧,都被美元 10 反腐利器呼之欲出,貪腐官員 娛樂排行 體育排行 01 2013網絡票選新宅男女神榜單 02 從《千金歸來》看中國電視劇 03 汪峰自稱是好爸爸時大家都笑 04 黃聖依稱楊子是靠山打了誰的 05 汪峰連鎖型劣跡被爆遭六六嘲 06 舒淇深V禮服到肚臍令人窒息 07 張柏芝交老外新歡照曝光(圖 08 吳奇隆公開戀情眾網友送祝福 09 獨家:趙本山愛女妞妞練功美 10 “幫汪峰上頭條”背后的注意 01 道歉信和危機公關 02 【環球熱點】鳥人(視頻) 03 曼聯宣布維迪奇已出院 04 哈登,別讓假摔毀了形象。。。。。。

 

也可以統一放到一個文件中去查看:

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -dump ./data/segments/20140111162237  ./data/segments/20140111162237_dump 
SegmentReader: dump segment: data/segments/20140111162237 
SegmentReader: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ less ./data/segments/20140111162237_dump/dump

 

3.通過list,get列出segments一些統計信息

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -list -dir data/segments 
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 
20140111162237 1 2014-01-11T16:22:46 2014-01-11T16:22:46 1 1 
20140111162320 57 2014-01-11T16:23:27 2014-01-11T16:24:43 58 19 
20140111162513 135 2014-01-11T16:25:21 2014-01-11T16:30:09 140 102

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -list  data/segments/20140111162320/ 
NAME        GENERATED    FETCHER START        FETCHER END        FETCHED    PARSED 
20140111162320    57        2014-01-11T16:23:27    2014-01-11T16:24:43    58    19

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ bin/nutch readseg -get  data/segments/20140111162513http://100w.tianya.cn/ 
SegmentReader: get 'http://100w.tianya.cn/' 
Crawl Parse:: 
Version: 7 
Status: 67 (linked) 
Fetch time: Sat Jan 11 16:30:18 CST 2014 
Modified time: Thu Jan 01 08:00:00 CST 1970 
Retries since fetch: 0 
Retry interval: 2592000 seconds (30 days) 
Score: 1.8224896E-6 
Signature: null 
Metadata:

 

 

4.通過readdb及topN參數命令查看按分值排序的url

(1).這里我設定的條件為:前10條,分值大於1

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -topN 10 ./data/crawldb_topN 1 
CrawlDb topN: starting (topN=10, min=1.0) 
CrawlDb db: ./data/crawldb 
CrawlDb topN: collecting topN scores. 
CrawlDb topN: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ cat ./data/crawldb_topN/part-00000 
1.0140933    http://blog.tianya.cn/ 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$

 

(2).不設分值條件,查詢前10條

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch readdb ./data/crawldb -topN 10 ./data/crawldb_topN_all_score 
CrawlDb topN: starting (topN=10, min=0.0) 
CrawlDb db: ./data/crawldb 
CrawlDb topN: collecting topN scores. 
CrawlDb topN: done 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ cat ./data/crawldb_topN_all_score/part-00000 
1.0140933    http://blog.tianya.cn/ 
0.046008706    http://blog.tianya.cn/blog/society 
0.046008706    http://blog.tianya.cn/blog/international 
0.030586869    http://blog.tianya.cn/blog/mingbo 
0.030586869    http://blog.tianya.cn/blog/daren 
0.030330064    http://www.tianya.cn/mobile 
0.029951613    http://blog.tianya.cn/blog/culture 
0.029951613    http://blog.tianya.cn/blog/history 
0.029951613    http://blog.tianya.cn/blog/life 
0.029951613    http://blog.tianya.cn/blog/stock 
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$


注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2021 ITdaan.com