Nutch
From KallestadWiki
Contents |
Site Search with Nutch
What follows here should be a basic set of instructions for setting up site search with nutch.
Setting up Tomcat
Getting tomcat up and running is relatively easy. Do not use the root login for install as it is not necessary to have superuser priveleges.
- Have java installed
- Download the install file for tomcat
- extract the tomcat archive
- move the files, or link them to a easier to remember named directory - such as ./tomcat rather than ./apache-tomcat-x.x.xx
Edit the server.xml file:
<Server port="8005" shutdown="SHUTDOWN">
<Service name="Catalina">
<Connector port="8080" address="123.456.789.255"/>
<Connector port="8009" protocol="AJP/1.3" enableLookups="false" address="127.0.0.1"
maxThreads="100" minSpareThreads="10" maxSpareThreads="35" />
<Engine name="Catalina" defaultHost="localhost">
<Realm className="org.apache.catalina.realm.UserDatabaseRealm"
resourceName="UserDatabase" />
<Host name="localhost" appBase="webapps" unpackWARs="true" autoDeploy="true" xmlValidation="false" xmlNamespaceAware="false"/>
</Engine>
</Service>
</Server>
This is a pretty minimal setup. Your mileage may vary.
Delete everything in your webapps directory
cd tomcat/webapps rm -rf **
Getting Running with Nutch
Set up some config files
Create a directory and a file for root urls
mkdir urls cd urls vi nutch <<insert your site url ex:>> http://lucene.apache.org/nutch/ <<:wq>>
open up conf/crawl-urlfilter.txt :
Replace my.domain.com with yourdomain.com
+^http://([a-z0-9]*\.)*apache.org/
Add js files to the list of files to skip
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|js)$
Run a sample search for troubleshooting:
bin/nutch crawl urls -dir crawl -depth 3 -topN 25 >crawl-results.txt
Examine crawl-results.txt to ensure that the crawler is hitting the correct files and missing the appropriate ones as well. If you see something you don't like, edit your conf/crawl-urlfilter.txt once again.
Note: After reviewing my own crawl-results.txt, I had to add the $ symbol to my list of url parameters not to grab.
Verify searching capability
You can run a search on the command line pretty easily:
bin/nutch org.apache.nutch.searcher.NutchBean "Search Term"
View the web client
Copy the nutch.war file as ROOT.war in the tomcat directory
cp nutch*.war /wherever/tomcat/is/tomcat/webapps/ROOT.war
Start up tomcat - this has to be done from the parent of the crawl directory
cd /wherever/nutch/is/ /wherever/tomcat/is/bin/catalina.sh start
You should be able to view the web client on port 8080 or whatever port you have defined in your server.xml file. If you are using AJP, you'll need to set up apache as an ajp proxy first.
