I updated the wiki with some instructions on how to get nutch running as a site search engine. I still have some problems with my own implementation, but I've successfully been able to proxy it without having to rewrite any html via the proxy, and that's pretty important.
Interesting about my new site design - I had originally thought that my inter-site linking was pretty good, but Nutch actually has informed me that it's not so good. Here's the reason - I set up the crawl depth to 3, and as it turns out - a lot of my blog articles are actually four links away from the root url - Root page - Blog Page - Category Page - Blog Article Page. I could certainly reduce that with a sitemap, but the point is to actually have a strong site structure to begin with.
I have a couple of thoughts on how to deal with this issue, but the easiest to implement is to add the blog category list to the menu items at the top of the page - which will give them more importance than I wanted to give them, but it will put each article at two links away from the home page, once I get the home page on the same template as the blog - which is what I desire. I also want to get the wiki on this template eventually, but that's a whole different ball game and not very important to me right now.
Actually, I do eventually want to move the blog to the root, so articles are published directly off of the site root, and I can replace the blog area with a separate blog... Do you follow that? What I mean to say is that I want to use MT as an article publishing interface and run that publication engine at the site root. I want to addtionally use MT as a blog interface and run that off of the currently allocated /blog location. The difference between the two would be that the site-root would consist of more in-depth information, whereas the blog area would consist of more bloggy information.
I will post some instructions soon for proxying nutch, but I want to migrate from my current implementation over to an ajp format since I now have no need to rewrite html through the proxy.
Nutch Site Search Implemented Commentary
