Ubuntu环境下Nutch+Tomcat 搭建简单的搜索引擎

简易的搜索引擎搭建

我的配置：

Nutch：1.2

Tomcat：7.0.57

1 Nutch设置

修改Nutch配置

1.1 修改conf/nutch-site.xml

        
           1
        
         <?xml version=
        
          "
        
        
          1.0
        
        
          "
        
        ?>


        
           2
        
         <?xml-stylesheet type=
        
          "
        
        
          text/xsl
        
        
          "
        
         href=
        
          "
        
        
          configuration.xsl
        
        
          "
        
        ?>


        
           3
        
        
           4
        
         <!-- Put site-specific property overrides 
        
          in
        
        
          this
        
         file. -->


        
           5
        
        
           6
        
         <configuration>


        
           7
        
        
           8
        
             <!--property> 


        
           9
        
             <name>storage.data.store.
        
          class
        
        </name> 


        
          10
        
             <value>org.apache.gora.hbase.store.HBaseStore</value> 


        
          11
        
             <description>Default 
        
          class
        
        
          for
        
         storing data</description> 


        
          12
        
             </property> 


        
          13
        
             <property>


        
          14
        
             <name>http.agent.name</name> 


        
          15
        
             <value>xxx0624-ThinkPad-Edge</value> 


        
          16
        
             </property-->


        
          17
        
        
          18
        
         <property>


        
          19
        
           <name>http.agent.name</name>


        
          20
        
           <value>nutch1.
        
          0
        
        </value>


        
          21
        
         </property>


        
          22
        
        
          23
        
         <property>


        
          24
        
           <name>plugin.folders</name>


        
          25
        
           <value>./plugins</value>


        
          26
        
         </property>


        
          27
        
        
          28
        
         </configuration>

View Code

1.2 修改conf/crawl-urlfilter.txt

      
        1
      
       # accept hosts 
      
        in
      
      
         MY.DOMAIN.NAME


      
      
        2
      
       +^http:
      
        //
      
      
        ([a-z0-9]*\.)*sohu.com/

找到该处进行修改。我的是以sohu网为例。表示只爬取sohu.com结尾的网页。

1.3 增加文件夹

在nutch目录下mkdir一个新的文件夹名字为urls，再在里面建立一个空的txt文件名字为urls.txt。

在urls.txt中写入要爬取的网页地址：如http://www.sohu.com/

1.4 开始爬取

命令：

      bin/nutch crawl urls/urls.txt -dir crawled -depth 5 -threads 5 -topN 200

crawled指爬取网页的结果的存储位置，当爬取结束时，会自动生成5个文件夹：crawldb，index，indexes，linkdb，segments

2 tomcat设置

2.1 将nutch编译后的war包放在tomcat的webapps下，再启动tomcat，再在生成的nutch1.2文件夹下修改WEB-INF/classes/nutch-sites.xml

      <property>    

    <name>searcher.dir</name>    

    <value>/home/xxx0624/nutch-
      
        1.2
      
      /crawled</value>    

</property>

这是设置抓取网页信息的文件位置

2.2 针对中文乱码修改

2.2.1 修改tomcat配置文件conf/server.xml

      
        1
      
       <Connector port=
      
        "
      
      
        8080
      
      
        "
      
       protocol=
      
        "
      
      
        HTTP/1.1
      
      
        "
      
      
        2
      
       connectionTimeout=
      
        "
      
      
        20000
      
      
        "
      
      
        3
      
       redirectPort=
      
        "
      
      
        8443
      
      
        "
      
      
        4
      
       URIEncoding=
      
        "
      
      
        UTF-8
      
      
        "
      
      
        5
      
       useBodyEncodingForURI=
      
        "
      
      
        true
      
      
        "
      
      />

增加其中的URIEncoding和useBodyEncodingForURI

2.2.2 修改nutch-1.2/cache.jsp

找到这一部分

        
           1
        
         Metadata metaData =
        
           bean.getParseData(details).getContentMeta();


        
        
           2
        
         ParseData ParseData =
        
           bean.getParseData(details);  


        
        
           3
        
           String content = 
        
          null
        
        
          ;


        
        
           4
        
        
          //
        
        
           String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
        
        
           5
        
         String contentType =
        
           ParseData.getMeta(Metadata.CONTENT_TYPE);


        
        
           6
        
        
          if
        
         (contentType.startsWith(
        
          "
        
        
          text/html
        
        
          "
        
        
          )) {


        
        
           7
        
        
          //
        
        
           FIXME : it's better to emit the original 'byte' sequence 


        
        
           8
        
        
          //
        
        
           with 'charset' set to the value of 'CharEncoding',


        
        
           9
        
        
          //
        
        
           but I don't know how to emit 'byte sequence' in JSP.


        
        
          10
        
        
          //
        
        
           out.getOutputStream().write(bean.getContent(details)) may work, 


        
        
          11
        
        
          //
        
        
           but I'm not sure.


        
        
          12
        
        
          //
        
        
          String encoding = (String) metaData.get("CharEncodingForConversion"); 
        
        
          13
        
             String encoding = ParseData.getMeta(
        
          "
        
        
          CharEncodingForConversion
        
        
          "
        
        
          ); 


        
        
          14
        
        
          if
        
         (encoding != 
        
          null
        
        
          ) {


        
        
          15
        
        
          try
        
        
           {


        
        
          16
        
                 content = 
        
          new
        
        
           String(bean.getContent(details), encoding);


        
        
          17
        
        
                }


        
        
          18
        
        
          catch
        
        
           (UnsupportedEncodingException e) {


        
        
          19
        
        
          //
        
        
           fallback to windows-1252
        
        
          20
        
                 content = 
        
          new
        
         String(bean.getContent(details), 
        
          "
        
        
          windows-1252
        
        
          "
        
        
          );


        
        
          21
        
        
                }


        
        
          22
        
        
              }


        
        
          23
        
        
          else
        
        
          24
        
          content = 
        
          new
        
         String(bean.getContent(details),
        
          "
        
        
          GBK
        
        
          "
        
        
          ); 


        
        
          25
        
        
          //
        
        
          content = new String(bean.getContent(details));

View Code

3 开始实验

重启tomcat

通过浏览器访问：http://localhost:8080/nutch-1.2

Ubuntu环境下Nutch+Tomcat 搭建简单的搜索引擎

更多文章、技术交流、商务合作、联系博主

微信扫码或搜索：z360901061

微信扫一扫加我为好友

QQ号联系： 360901061

您的支持是博主写作最大的动力，如果您喜欢我的文章，感觉我的文章对您有帮助，请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧，狠狠点击下面给点支持吧，站长非常感激您！手机微信长按不能支付解决办法：请将微信支付二维码保存到相册，切换到微信，然后点击微信右上角扫一扫功能，选择支付二维码完成支付。

【本文对您有帮助就好】元

2元

5元

10元

20元

自定义