Introduction
Apache Nutch is an open source Web crawler written in Java. By using it, we can find Web page hyperlinks in an automated manner, reduce lots of maintenance work, for example checking broken links, and create a copy of all the visited pages for searching over. That’s where Apache Solr comes in. Solr is an open source full text search framework, with Solr we can search the visited pages from Nutch. Luckily, integration between Nutch and Solr is pretty straightforward as explained below.
Apache Nutch supports Solr out-the-box, greatly simplifying Nutch-Solr integration. It also removes the legacy dependence upon both Apache Tomcat for running the old Nutch Web Application and upon Apache Lucene for indexing. Just download a binary release from here .
---------------------------------------------------------------------------------------------------------------译文(如有不当请指正):
Apache Nutch 是一个用JAVA语言编写的开源web爬虫项目。通过使用它,我们能够以一种自动化的方式找到web页面上的超链接,减少了大量的维护工作,例如检查无用的链接或者创建一个所有访问过搜索页面的副本。讲到这里Apache Solr出现,Solr是一个开源的全文检索框架,通过solr我们能搜索Nutch访问过的页面。幸运的是,整合Nutch和Solr是十分简单的,例如下面的讲解。
Apache Nutch 支持Solr拆箱即用,使得Nutch 和solr的整合非常简单。同时也去除了遗留的依赖问题:不必在Apchce tomcat上运行老版本的Nutch web应用程序,也不必基于Lucene进行搜索。请下载一个Nutch的二进制版本从 http://www.apache.org/dyn/closer.cgi/nutch/