简易的搜索引擎搭建
我的配置:
Nutch:1.2
Tomcat:7.0.57
1 Nutch设置
修改Nutch配置
1.1 修改conf/nutch-site.xml
1 <?xml version= " 1.0 " ?> 2 <?xml-stylesheet type= " text/xsl " href= " configuration.xsl " ?> 3 4 <!-- Put site-specific property overrides in this file. --> 5 6 <configuration> 7 8 <!--property> 9 <name>storage.data.store. class </name> 10 <value>org.apache.gora.hbase.store.HBaseStore</value> 11 <description>Default class for storing data</description> 12 </property> 13 <property> 14 <name>http.agent.name</name> 15 <value>xxx0624-ThinkPad-Edge</value> 16 </property--> 17 18 <property> 19 <name>http.agent.name</name> 20 <value>nutch1. 0 </value> 21 </property> 22 23 <property> 24 <name>plugin.folders</name> 25 <value>./plugins</value> 26 </property> 27 28 </configuration>
1.2 修改conf/crawl-urlfilter.txt
1 # accept hosts in MY.DOMAIN.NAME 2 +^http: // ([a-z0-9]*\.)*sohu.com/
找到该处进行修改。我的是以sohu网为例。表示只爬取sohu.com结尾的网页。
1.3 增加文件夹
在nutch目录下mkdir一个新的文件夹 名字为urls,再在里面建立一个空的txt文件 名字为urls.txt。
在urls.txt中写入要爬取的网页地址:如http://www.sohu.com/
1.4 开始爬取
命令:
bin/nutch crawl urls/urls.txt -dir crawled -depth 5 -threads 5 -topN 200
crawled指爬取网页的结果的存储位置,当爬取结束时,会自动生成5个文件夹:crawldb,index,indexes,linkdb,segments
2 tomcat设置
2.1 将nutch编译后的war包放在tomcat的webapps下,再启动tomcat,再在生成的nutch1.2文件夹下修改WEB-INF/classes/nutch-sites.xml
<property>
<name>searcher.dir</name>
<value>/home/xxx0624/nutch-
1.2
/crawled</value>
</property>
这是设置抓取网页信息的文件位置
2.2 针对中文乱码修改
2.2.1 修改tomcat配置文件conf/server.xml
1 <Connector port= " 8080 " protocol= " HTTP/1.1 " 2 connectionTimeout= " 20000 " 3 redirectPort= " 8443 " 4 URIEncoding= " UTF-8 " 5 useBodyEncodingForURI= " true " />
增加其中的URIEncoding和useBodyEncodingForURI
2.2.2 修改nutch-1.2/cache.jsp
找到这一部分
1 Metadata metaData = bean.getParseData(details).getContentMeta(); 2 ParseData ParseData = bean.getParseData(details); 3 String content = null ; 4 // String contentType = (String) metaData.get(Metadata.CONTENT_TYPE); 5 String contentType = ParseData.getMeta(Metadata.CONTENT_TYPE); 6 if (contentType.startsWith( " text/html " )) { 7 // FIXME : it's better to emit the original 'byte' sequence 8 // with 'charset' set to the value of 'CharEncoding', 9 // but I don't know how to emit 'byte sequence' in JSP. 10 // out.getOutputStream().write(bean.getContent(details)) may work, 11 // but I'm not sure. 12 // String encoding = (String) metaData.get("CharEncodingForConversion"); 13 String encoding = ParseData.getMeta( " CharEncodingForConversion " ); 14 if (encoding != null ) { 15 try { 16 content = new String(bean.getContent(details), encoding); 17 } 18 catch (UnsupportedEncodingException e) { 19 // fallback to windows-1252 20 content = new String(bean.getContent(details), " windows-1252 " ); 21 } 22 } 23 else 24 content = new String(bean.getContent(details), " GBK " ); 25 // content = new String(bean.getContent(details));
3 开始实验
重启tomcat
通过浏览器访问:http://localhost:8080/nutch-1.2