To build Heritrix in Eclipse

系统 1994 0

 

 

To build Heritrix in Eclipse

This uses Heritrix 1.14.4 (2010 Year 5 dated 10 version is the latest version of the current situation)

1. First of all download from http://sourceforge.net/projects/archive-crawler/
heritrix-1.14.4.zip
heritrix-1.14.4-src.zip

2. In Eclipse create a java project in the works, respectively,
heritrix-1.14.4.zip
heritrix-1.14.4-src.zip to extract.

3. Will heritrix-1.14.4-src. zip Unzip the src / java in the com, org, st three files under the src folder to the project.
4. Will heritrix-1.14.4-src.zip Unzip the src in the conf folder to the project root directory .
5. Will heritrix-1.14.4-src.zip Unzip in the lib folder to the project root directory.
6. Will heritrix-1.14.4-src.zip Unzip in src / resources / org / archive / util in tlds-alpha-by-domain.txt file to the next project org.archive.util package.
7. Will heritrix-1.14.4.zip extract the webapps folder to the project root directory.
If the folder name is not in the webapps need to make the appropriate changes Heritrix.java.

        /**
     * @throws IOException
     * @return Returns the directory under which reside the WAR files
     * we're to load into the servlet container.
     */
    public static File getWarsdir()
    throws IOException {
        return getSubDir("webapps");
    }


  


8. Configuration file changes, find the conf file under the heritrix.properties

    // Set the user password  
heritrix.cmdline.admin = admin:admin
// Set port  
heritrix.cmdline.port = 8080

  


9. Jar works package on the introduction of the all the jar lib package following the introduction of engineering.
10. Org.archive.crawler.Heritrix.java found right in the project configuration options selected operating mode Classpath
Select User Entries - Advanced
Select Add Folders to add into the conf folder.
Click Start Run Run

    05:22:32.875 EVENT  Starting Jetty/4.2.23
05:22:32.937 WARN!! Delete existing temp dir C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\Jetty_127_0_0_1_8080__ for WebApplicationContext[/,jar:file:/D:/workspace/jcjcd/heritrixDemo/webapps/admin.war!/]
05:22:33.062 EVENT  Started WebApplicationContext[/,Heritrix Console]
05:22:33.156 EVENT  Started SocketListener on 127.0.0.1:8080
05:22:33.156 EVENT  Started org.mortbay.jetty.Server@1f6f0bf
Heritrix version: @VERSION@

  


So far we have completed the configuration under Heritrix in Eclipse.

Now we can create a job for testing.

To build Heritrix in Eclipse

1. Http://127.0.0.1:8080 in your browser and enter the user input configuration file name password.
Two. Next, we create a job, select the navigation menu in the jobs, select CreateNewJob With defaults.

3. Were filled name, description, and to be crawling the url.
4. Select modules, here we will grab the results to create a mirror image, the default is compressed, Select Writers of org.archive.crawler.writer.ARCWriterProcessor remove and re-add a org.archive.crawler.writer.MirrorWriterProcessor
5. Select Setting bottom of the page set, many items can be set here, such as the maximum number of threads, timeout and so on.
There are two must be set
http-headers HTTP headers.
user-agent: Mozilla/5.0 (compatible; heritrix / @ VERSION @ + PROJECT_URL_HERE)
from: CONTACT_EMAIL_ADDRESS_HERE

I am here simply to replace @ VERSION @ heritrix version
PROJECT_URL_HERE local ip changed to http://
CONTACT_EMAIL_ADDRESS_HERE wrote a random email address above configuration is complete select submitjob.

To build Heritrix in Eclipse

To build Heritrix in Eclipse

6. To Console Click to start the beginning of the crawl job.
Crawl under the completed projects to see jobs in the folder can be found in the folder

To build Heritrix in Eclipse

 

 

文章来自: http://www.codeweblog.com/to-build-heritrix-in-eclipse/

http://www.codeweblog.com/search/Heritrix/

To build Heritrix in Eclipse


更多文章、技术交流、商务合作、联系博主

微信扫码或搜索:z360901061

微信扫一扫加我为好友

QQ号联系: 360901061

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请用微信扫描下面二维码支持博主2元、5元、10元、20元等您想捐的金额吧,狠狠点击下面给点支持吧,站长非常感激您!手机微信长按不能支付解决办法:请将微信支付二维码保存到相册,切换到微信,然后点击微信右上角扫一扫功能,选择支付二维码完成支付。

【本文对您有帮助就好】

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请用微信扫描上面二维码支持博主2元、5元、10元、自定义金额等您想捐的金额吧,站长会非常 感谢您的哦!!!

发表我的评论
最新评论 总共0条评论