Playing with Nutch


After a brief review of Solr, I decided to give Nutch a go. Unlike Solr, which is distributed with Jetty so you can get going right away, nutch requires that Tomcat be set up first. This isn't too much of a problem, but strangely, Nutch has to be set up as the root context. I'm sure there's a way around that, but none of the setup guides I've discovered have addressed the issue other than to mention it exists.

Nutch itself does have some advantages over Solr for a site search engine - it comes with it's own crawling utility that is fairly simple to set up and use. It also has a web interface for searching that you can use pretty easily.

On the one hand, it's a very powerful and useful piece of software that you can use as prescribed with minimal installation headache. On the other hand, configuration requires setting up a bunch of XML files and reading through a lot of documentation, mailing lists, and the wiki if you are to fully utilize it's capabilities.

Strangely, while Solr provides a decent administrative interface, i haven't seen one for Nutch. Maintenance seems to be command line based, but perhaps I just missed it. The web interface for searching does leave a little bit to be desired. My experience is that it is showing a small set of results (less than 5) even when the number of returned documents is much larger. I don't quite see the logic in that, but there is a button to present the full result set. Pressing that button will present 10 results and give you another button to see the next 10 results. There are no links to get you to a specific page of results directly.

Performance-wise, the first few searches seemed on the slow side. A few minutes later performance really picked up and searches were nearly instant. I wasn't running much of anything on my machine at the time, so I don't know what to attribute the noticable difference. Restarting tomcat shows the same pattern consistently.

Like with Solr, I don't see any easy way to tweak the results to your liking, but I do see scoring summaries in the log files so perhaps Nutch does have a configuration set that would be useful.

On the one hand, I like having a web search utility right out of the box. On the other hand, more than likely you would end up spending a fair amount of time styling the search utility so writing your own and interfacing via JSON or XML with Solr isn't such a bad idea. I don't think there would be much of a difference in time spent.

You could easily set up a small vertical search engine with Nutch. It wouldn't even take much effort if all you wanted was a proof of concept. Without the ability to tweak the results manually, and without a way to penalize sites your system I don't think the results would be very desirable.




03-03-2007, 02:18 AM  
Devin
 
 
Re: Discussion: Playing with Nutch

I see that you are using nutch now. Any particular reason why you chose that route versus one of the others?
  Reply With Quote


Discuss Playing with Nutch