Nutch

From KallestadWiki

Jump to: navigation, search

Contents

Site Search with Nutch

What follows here should be a basic set of instructions for setting up site search with nutch.

Setting up Tomcat

Getting tomcat up and running is relatively easy. Do not use the root login for install as it is not necessary to have superuser priveleges.

  • Have java installed
  • Download the install file for tomcat
  • extract the tomcat archive
  • move the files, or link them to a easier to remember named directory - such as ./tomcat rather than ./apache-tomcat-x.x.xx

Edit the server.xml file:

 <Server port="8005" shutdown="SHUTDOWN">
   <Service name="Catalina">
     <Connector port="8080" address="123.456.789.255"/>
     <Connector port="8009" protocol="AJP/1.3" enableLookups="false" address="127.0.0.1"
                 maxThreads="100" minSpareThreads="10" maxSpareThreads="35" />
     <Engine name="Catalina" defaultHost="localhost">
       <Realm className="org.apache.catalina.realm.UserDatabaseRealm"
              resourceName="UserDatabase" />
       <Host name="localhost" appBase="webapps" unpackWARs="true" autoDeploy="true" xmlValidation="false" xmlNamespaceAware="false"/>
     </Engine>
   </Service>
 </Server>

This is a pretty minimal setup. Your mileage may vary.

Delete everything in your webapps directory

 cd tomcat/webapps
 rm -rf **

Getting Running with Nutch

Set up some config files

Create a directory and a file for root urls

 mkdir urls
 cd urls
 vi nutch
 <<insert your site url ex:>>
 http://lucene.apache.org/nutch/
 <<:wq>>

open up conf/crawl-urlfilter.txt :

Replace my.domain.com with yourdomain.com

+^http://([a-z0-9]*\.)*apache.org/

Add js files to the list of files to skip

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$

Run a sample search for troubleshooting:

 bin/nutch crawl urls -dir crawl -depth 3 -topN 25 >crawl-results.txt

Examine crawl-results.txt to ensure that the crawler is hitting the correct files and missing the appropriate ones as well. If you see something you don't like, edit your conf/crawl-urlfilter.txt once again.

Note: After reviewing my own crawl-results.txt, I had to add the $ symbol to my list of url parameters not to grab.

Verify searching capability

You can run a search on the command line pretty easily:

 bin/nutch org.apache.nutch.searcher.NutchBean "Search Term"

View the web client

Copy the nutch.war file as ROOT.war in the tomcat directory

cp nutch*.war /wherever/tomcat/is/tomcat/webapps/ROOT.war

Start up tomcat - this has to be done from the parent of the crawl directory

 cd /wherever/nutch/is/
 /wherever/tomcat/is/bin/catalina.sh start

You should be able to view the web client on port 8080 or whatever port you have defined in your server.xml file. If you are using AJP, you'll need to set up apache as an ajp proxy first.

Personal tools