简易的搜索引擎搭建
我的配置:
Nutch:1.2
Tomcat:7.0.57
1 Nutch设置
修改Nutch配置
1.1 修改conf/nutch-site.xml
1
<?xml version=
"
1.0
"
?>
2
<?xml-stylesheet type=
"
text/xsl
"
href=
"
configuration.xsl
"
?>
3
4
<!-- Put site-specific property overrides
in
this
file. -->
5
6
<configuration>
7
8
<!--property>
9
<name>storage.data.store.
class
</name>
10
<value>org.apache.gora.hbase.store.HBaseStore</value>
11
<description>Default
class
for
storing data</description>
12
</property>
13
<property>
14
<name>http.agent.name</name>
15
<value>xxx0624-ThinkPad-Edge</value>
16
</property-->
17
18
<property>
19
<name>http.agent.name</name>
20
<value>nutch1.
0
</value>
21
</property>
22
23
<property>
24
<name>plugin.folders</name>
25
<value>./plugins</value>
26
</property>
27
28
</configuration>
1.2 修改conf/crawl-urlfilter.txt
1
# accept hosts
in
MY.DOMAIN.NAME
2
+^http:
//
([a-z0-9]*\.)*sohu.com/
找到该处进行修改。我的是以sohu网为例。表示只爬取sohu.com结尾的网页。
1.3 增加文件夹
在nutch目录下mkdir一个新的文件夹 名字为urls,再在里面建立一个空的txt文件 名字为urls.txt。
在urls.txt中写入要爬取的网页地址:如http://www.sohu.com/
1.4 开始爬取
命令:
bin/nutch crawl urls/urls.txt -dir crawled -depth 5 -threads 5 -topN 200
crawled指爬取网页的结果的存储位置,当爬取结束时,会自动生成5个文件夹:crawldb,index,indexes,linkdb,segments
2 tomcat设置
2.1 将nutch编译后的war包放在tomcat的webapps下,再启动tomcat,再在生成的nutch1.2文件夹下修改WEB-INF/classes/nutch-sites.xml
<property>
<name>searcher.dir</name>
<value>/home/xxx0624/nutch-
1.2
/crawled</value>
</property>
这是设置抓取网页信息的文件位置
2.2 针对中文乱码修改
2.2.1 修改tomcat配置文件conf/server.xml
1
<Connector port=
"
8080
"
protocol=
"
HTTP/1.1
"
2
connectionTimeout=
"
20000
"
3
redirectPort=
"
8443
"
4
URIEncoding=
"
UTF-8
"
5
useBodyEncodingForURI=
"
true
"
/>
增加其中的URIEncoding和useBodyEncodingForURI
2.2.2 修改nutch-1.2/cache.jsp
找到这一部分
1
Metadata metaData =
bean.getParseData(details).getContentMeta();
2
ParseData ParseData =
bean.getParseData(details);
3
String content =
null
;
4
//
String contentType = (String) metaData.get(Metadata.CONTENT_TYPE);
5
String contentType =
ParseData.getMeta(Metadata.CONTENT_TYPE);
6
if
(contentType.startsWith(
"
text/html
"
)) {
7
//
FIXME : it's better to emit the original 'byte' sequence
8
//
with 'charset' set to the value of 'CharEncoding',
9
//
but I don't know how to emit 'byte sequence' in JSP.
10
//
out.getOutputStream().write(bean.getContent(details)) may work,
11
//
but I'm not sure.
12
//
String encoding = (String) metaData.get("CharEncodingForConversion");
13
String encoding = ParseData.getMeta(
"
CharEncodingForConversion
"
);
14
if
(encoding !=
null
) {
15
try
{
16
content =
new
String(bean.getContent(details), encoding);
17
}
18
catch
(UnsupportedEncodingException e) {
19
//
fallback to windows-1252
20
content =
new
String(bean.getContent(details),
"
windows-1252
"
);
21
}
22
}
23
else
24
content =
new
String(bean.getContent(details),
"
GBK
"
);
25
//
content = new String(bean.getContent(details));
3 开始实验
重启tomcat
通过浏览器访问:http://localhost:8080/nutch-1.2

