Nutch as a site search tool


I've implemented Nutch locally, and I'll probably be updating the appearance of it regularly over the next few weeks, but you can use the my site search tool at that link. The difference between this search, this one, and this one is that the first link (Nutch) actually is a full site search, whereas the latter two are specific to the blog and the wiki. The site search actually has a more extensive heuristics set, and is faster with lower resource requirements, whereas the search facilities built into MediaWiki and Movable Type are based on what is returned from mysql. The mysql result set is generally not that great, but on a site like mine with fewer than a couple hundred pages (right now) it is usually pretty good. You can compare and contrast and use what you want.

Gettiing Nutch set up on this site was no easy task. Sure, getting Tomcat working and running an initial crawl with the nutch crawler wasn't so bad, but having it sit on a subdirectory rather than on it's own separate IP was the real problem. There doesn't seem to be a lot of information out there regarding the implementation of Nutch and the mailing lists aren't exactly peppered with newbie questions. In the end, I used mod_proxy and mod_proxy_html to get it working.

I toyed with the idea of using mod_jk and I actually made an attempt at using mod_proxy_ajp, but in the end just basic http proxying worked and I probably won't revisit the proxy architecture until I release my new site layout.

I hate to have to use mod_proxy_html for a reverse proxy, but finding all of the files needed to make template adjustments is a little bit painful. Not only that, but I have to read through html with java seeded all over the place, and use vi to edit it because I my subconscious mind has taken a stand against the idea of me downloading the files and editing them locally.



Talk About Nutch as a site search tool