Ubuntu環境下利用ant編譯nutch2.2.1 & 配置nutch2.2.1


 

/×××××××××××××××××××××××××××××××××××××××××/

 

Author:xxx0624

 

HomePage:http://www.cnblogs.com/xxx0624/

 

/×××××××××××××××××××××××××××××××××××××××××/

 

 

 

利用ant編譯nutch2.x

詳見:1.    http://blog.javachen.com/2014/05/20/nutch-intro/

      2.    wiki.apache.org/nutch/Nutch2Tutorial

   3.    http://duguyiren3476.iteye.com/blog/2085973  (編譯過程參見這個地址)

前提條件:配置ant(http://www.cnblogs.com/xxx0624/p/4172277.html)

1. 下載nutch(例如:我的是apache-nutch-2.2.1-src.tar.gz)

解壓,重命名nutch文件夾(命名為nutch),然后移動文件夾到/home文件夾下

 

2. 編譯nutch

  2.1 准備工作

  (1)待會兒編譯可能會出現這個錯誤

Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.
ivy-probe-antlib:
ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

   原因:nutch中缺少相應的jar包。

  解決辦法:

    (1)下載sonar-ant-task-2.1.jar,並直接放到nutch文件夾目錄下

    (2)修改build.xml文件,從而引入這個新的jar

<!-- Define the Sonar task if this hasn't been done in a common script -->
<taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
    <classpath path="${ant.library.dir}" />
    <classpath path="${mysql.library.dir}" />
    <classpath><fileset dir="." includes="sonar*.jar" /></classpath> 
</taskdef>

   (2)編譯時間過長:

  nutch使用ivy進行構建,故編譯時間長。如果時間過長,可使用該辦法解決。

  修改該文件:ivy/ivysettings.xml

http://mirrors.ibiblio.org/maven2/

 替換掉:

http://repo1.maven.org/maven2/

   2.2 修改nutch配置

  (1)修改nutch的conf/nutch-site.xml文件

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

	<property>  
	<name>storage.data.store.class</name>  
	<value>org.apache.gora.hbase.store.HBaseStore</value>  
	<description>Default class for storing data</description>  
	</property>  
<property> 
	<name>http.agent.name</name>  
	<value>xxx0624-ThinkPad-Edge</value>  
	</property> 
</configuration>

   (2)修改ivy/ivy.xml文件

<dependency org="org.apache.gora" name="gora-hbase" rev="0.3"  conf="*->default" />  <!--把該行的注釋去掉,使之生效-->

   (3)修改conf/gola.properies

    gora.datastore.default=org.apache.gora.hbase.store.HBaseStore  
#gora.sqlstore.jdbc.driver=org.hsqldb.jdbc.JDBCDriver
#gora.sqlstore.jdbc.url=jdbc:hsqldb:hsql://localhost/nutchtest  
#gora.sqlstore.jdbc.user=sa  
#gora.sqlstore.jdbc.password=  

   2.3 編譯nutch(需在nutch當前目錄下進行編譯)

cd nutch
ant

   2.4 編譯之后的目錄:

.
├── build
├── build.xml
├── build.xml~
├── CHANGES.txt
├── conf
├── default.properties
├── docs
├── ivy
├── lib
├── LICENSE.txt
├── NOTICE.txt
├── README.txt
├── runtime
├── sonar-ant-task-2.1.jar
└── src

7 directories, 8 files

 

3. 修改nutch配置文件(在第2步中均已完成)

    Nutch2.x版本存儲采用Gora訪問Cassandra、HBase、Accumulo、Avro等,需要在該文件中制定Gora屬性。

 3.1修改 conf/nutch-site.xml(第2步中已完成)

<property>
  <name>storage.data.store.class</name>
  <value>org.apache.gora.hbase.store.HBaseStore</value>
  <description>Default class for storing data</description>
</property>

  3.2 修改 ivy/ivy.xml(第2步中已完成)

<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />

  3.3 修改 conf/gora.properties(第2步中已完成)

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

 

/*****************************************************************************************************************************/

配置nutch

(nutch文件夾已在/home目錄下)

1. 修改系統環境變量

sudo gedit /etc/profile

 //增加

#set nutch
export PATH=/home/nutch/runtime/local/bin:$PATH

 

2. 測試(nutch/runtime/local/bin中./nutch  &  ./crawl)

nutch
//結果如下:
Usage: nutch COMMAND
where COMMAND is one of:
 inject		inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate 	generate new batches to fetch from crawl db
 fetch 		fetch URLs marked during generate
 parse 		parse URLs marked during fetch
 updatedb 	update web table after parsing
 updatehostdb   update host table after parsing
 readdb 	read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex 	run the solr indexer on parsed batches
 solrdedup 	remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin 	load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit         	runs the given JUnit test
 or
 CLASSNAME 	run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

 

crawl
//結果如下:
Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>

 

  由於nutch與hbase的使用還會有新的錯誤出現,故在新文章中記錄:nutch集成hbase(http://www.cnblogs.com/xxx0624/p/4176199.html)


注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
粤ICP备14056181号  © 2014-2020 ITdaan.com