-
Notifications
You must be signed in to change notification settings - Fork 14
How To_ Nutch
This page describes how you can use nutch to crawl a website and generate a lucene index that can be used by the gentics content connector.
- Index (Crawl) Server - (hint: use linux ;))
- The latest nutch release.
Connect to your linux index server and navigate to a directory where you want to install nutch. Then download the latest release.
//Change the url for your needs
wget http://mirror.deri.at/apache/lucene/nutch/nutch-1.0.tar.gz
Then you have to extract the archive
tar xzf nutch-1.0.tar.gz
Editing the nutch configuration is very simple. You only have to follow these small steps:
First you have to create a custom directory that holds the urls you want to index. Navigate to your nutch base directory and do the following:
mkdir urls
cd urls
vim nutch
Now you should have opened the file nutch in your new urls directory. There you have to specify the entry page where the crawler should start.
e.g. we want to index http://www.gentics.com => contents of nutch http://www.gentics.com
Open the file /conf/crawl-urlfilter.txt and edit the filter for the urls you want to allow the crawler to follow.
We only change the filter to allow all subdomains of gentics.com and leave everything else as it is. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*.)gentics.com/
Now we have to defind the crawlers identity that will be sent to the server as user agent.
Open the file /conf/nutch-site.xml and insert something like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>nutch index crawler</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>your.website</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>your.email.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>
In order to have a highlighted text snipped available you have to set the content attribute to be stored in the index. Unfortunately nutch sets this value hardcoded to false. You have to replace the existing index-basic plugin jar with the one provided here for download.
You find the existing plugin in the folder plugins/index-basic.
In order to get nutch to use this settings you have to edit conf/nutch-site.xml and add a block like this:
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
If you want to search for specific languages you have to enable the language-identifier plugin to add the lang field to the index.
<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|language-identifier|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
You also have to add the default configuration for the language-identifier to your nutch-default.xml
<!-- language-identifier plugin properties -->
<property>
<name>lang.ngram.min.length</name>
<value>1</value>
<description> The minimum size of ngrams to uses to identify
language (must be between 1 and lang.ngram.max.length).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>
<property>
<name>lang.ngram.max.length</name>
<value>4</value>
<description> The maximum size of ngrams to uses to identify
language (must be between lang.ngram.min.length and 4).
The larger is the range between lang.ngram.min.length and
lang.ngram.max.length, the better is the identification, but
the slowest it is.
</description>
</property>
<property>
<name>lang.analyze.max.length</name>
<value>2048</value>
<description> The maximum bytes of data to uses to indentify
the language (0 means full content analysis).
The larger is this value, the better is the analysis, but the
slowest it is.
</description>
</property>
<property>
<name>query.lang.boost</name>
<value>0.0</value>
<description> Used as a boost for lang field in Lucene query.
</description>
</property>
Once you configured everything you can test your configuration with a shallow depth and a low number of links to follow:
execute the command from your nutch base directory bin/nutch crawl urls -dir your_crawl_dir -depth 3 -topN 50
If you are happy with the results returned, you can increase the values. A common depth value for a full crawl would be around 10 and topN can range from hundreds to hundreds of thousands depending on the sources you want to index/crawl.
You can find more infos about how to crawl on this tutorial.
Once the crawling finished, you can find the index in nutch_base_dir/your_crawl_dir/index
If you like the outcome you can create a start script that is being executed as cronjob or called by the scheduler of Gentics Content.Node. Your start script could look something like that:
rm -r crawl
bin/nutch crawl urls -dir crawl -depth 5 -topN 50
rm -r index
cp -r crawl/index index
touch index/reopen
- You can find more information about nutch on the project homepage.
- Running Nutch on Windows
- Running Nutch on Windows without cygwin