I spent two days gathering information on which FOSS search engine and web crawler I should use for my company's internal search, and concluded with the combination of:
* Solr 1.4 (advanced search engine)
*
Nutch 1.1 (search engine with a highly configurable crawler)
This HOWTO is updated to match ubuntu 10.04 based on the work of Sami Siren at Lucid Imagination available here:
http://www.lucidimagination.com/blog...09/nutch-solr/
Overview
This HOWTO consists of the following:
- Installing Solr
- Installing Nutch
- Configuring Solr
- Configuring Nutch
- Crawling your site
- Indexing our crawl DB with solr
- Search the crawled content in Solr
Allright, 'nough talk. Let's get down to it!
Prerequisites
I'll assume that you have an Ubuntu 10.04 server installed and that you are logged in as root whilst working on this.
Installing Solr
Luckily, solr 1.4 is present in APT!
Code:
# apt-get install solr-common solr-tomcat tomcat6
Installing Nutch
Go to a proper working directory, download and unpack
Nutch:
Configuring Solr
For the sake of simplicity we are going to use the example configuration of Solr as a base.
Back up the original file:
Code:
# mv /etc/solr/conf/schema.xml /etc/solr/conf/schema.xml.orig
And replace the Solr schema with the one provided by
Nutch
Code:
# cp /usr/share/nutch/conf/schema.xml /etc/solr/conf/schema.xml
Now, we need to configure Solr to create snippets for search results.
Edit /etc/solr/conf/schema.xml and change the following line:
Code:
<field name="content" type="text" stored="false" indexed="true"/>
To this:
Code:
<field name="content" type="text" stored="true" indexed="true"/>
Create a new dismax request handler, to enabling relevancy tweaks.
Back up the original file:
Code:
# cp /etc/solr/conf/solrconfig.xml /etc/solr/conf/solrconfig.xml.orig
Add the following fragment to _/etc/solr/conf/solrconfig.xml_:
Code:
<!-- GZR: 2010-07-15 Added to integrate with Nutch crawler.
-->
<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="tie">0.01</str>
<str name="qf">
content^0.5 anchor^1.0 title^1.2
</str>
<str name="pf">
content^0.5 anchor^1.5 title^1.2 site^1.5
</str>
<str name="fl">
url
</str>
<str name="mm">
2<-1 5<-2 6<90%
</str>
<str name="ps">100</str>
<str name="hl">true</str>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>
Now, restart Tomcat:
Code:
# service tomcat6 restart
It is a good idea to tail -f the tomcat files and ensure that there are no errors.
Configuring Nutch
Go into the
nutch directory and do all the work from there:
Code:
cd /usr/share/nutch
Edit conf/nutch-site.xml and add the following in between the <configuration>-clauses:
Code:
<property>
<name>http.robots.agents</name>
<value>nutch-solr-integration-test,*</value>
<description></description>
</property>
<property>
<name>http.agent.name</name>
<value>nutch-solr-integration-test</value>
<description>FreeCode AS Robots Name</description>
</property>
<property>
<name>http.agent.description</name>
<value>FreeCode Norway Web Crawler using Nutch 1.0</value>
<description></description>
</property>
<property>
<name>http.agent.url</name>
<value>http://www.freecode.no/</value>
<description></description>
</property>
<property>
<name>http.agent.email</name>
<value>YOUR EMAIL ADDRESS HERE</value>
<description></description>
</property>
<property>
<name>http.agent.version</name>
<value></value>
<description></description>
</property>
<property>
<name>generate.max.per.host</name>
<value>100</value>
</property>
I need to ensure that the crawler does not leave our domain, otherwise I would end up crawling the entire Internet.
I inserted our domain into _conf/regex-urlfilter.txt:
Code:
# allow urls in freecode.no domain
+^http://([a-z0-9\-A-Z]*\.)*freecode.no/
# deny anything else
-.
Now, we need to instruct the crawler where to start crawling, so create a seed list:
Code:
# mkdir urls
# echo "http://www.freecode.no/" > urls/seed.txt
Crawling your site
Let's start crawling!
Start by injecting the seed url(s) to the
nutch crawldb:
Code:
# bin/nutch inject crawl/crawldb urls
Next, generate fetch list:
Code:
# bin/nutch generate crawl/crawldb crawl/segments
The above command generated a new segment directory under /usr/share/nutch/crawl/segments that contains the urls to be fetched. All following commands require accessing the latest segment directory as their main parameter so we’ll store it in an environment variable:
Code:
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
Launch the crawler!
Code:
bin/nutch fetch $SEGMENT -noParsing
And parse the fetched content:
Code:
bin/nutch parse $SEGMENT
Now we need to update the crawl database to ensure that for all future crawls,
Nutch only cheks the already crawled pages, and only fetches new and changed pages.
Code:
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
Create a link database:
Code:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
Now a full Fetch cycle is completed. I have created a script to simplify the crawl job:
Code:
#!/bin/bash
# short script to help crawl
unset SEGMENT
# set this to your nutch home dir
NUTCH_HOME="/usr/share/nutch"
cd ${NUTCH_HOME}
${NUTCH_HOME}/bin/nutch generate crawl/crawldb crawl/segments
[ $? = 0 ] || { echo "An error occured while generating the fetch list!"; exit $?; }
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
${NUTCH_HOME}/bin/nutch fetch $SEGMENT -noParsing
[ $? = 0 ] || { echo "An error occured while FETCHING the content!"; exit $?; }
${NUTCH_HOME}/bin/nutch parse $SEGMENT
[ $? = 0 ] || { echo "An error occured while PARSING the content!"; exit $?; }
${NUTCH_HOME}/bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
[ $? = 0 ] || { echo "An error occured while updating the crawl DB!"; exit $?; }
${NUTCH_HOME}/bin/nutch invertlinks crawl/linkdb -dir crawl/segments
[ $? = 0 ] || { echo "An error occured while creating link database!"; exit $?; }
# success!
exit 0
#EOF
Indexing our crawl DB with solr
Search the crawled content in Solr
Now the indexed content is available through Solr. You can try to execute searches from the Solr admin ui from
or directly with url like:
Bookmarks