Nutch and SEO and What it Means to You


Obviously, I've been playing around with Nutch quite a bit lately and I can't say that I'm the biggest fan of the implementation, it is very useful. Knowledge of Nutch and it's inner workings can be quite a bit more important for webmasters and SEOs in the near future as well.

We all know that the Wikipedia guy is launching a new search engine. That search engine is going to be based on Nutch - and while they probably won't be using scoring rules as released, they will for the most part have to play within the realm of Nutch and Lucene. WikiSearch will also be under the influence of any default settings for obscure Nutch features as they are not likely to be changed near term.

With Nutch getting closer and closer to a 1.0 release, and with the general public becoming more and more aware of what Nutch is and how to use it, I imagine that Nutch based search engines will become more commonplace - whether it is in the vertical search arena, personalized search, site search, or with a soon-to-be big boy like WikiSearch, Wikiasari, or whatever it is going to be called.

It is worth it for webmasters to pay attention to the nutch and lucene mailing lists to get a feel for the idiosyncrasies with the engine itself as well as for commonly recommended default settings. While the idea of optimizing for Nutch isn't something that is going to take over the SEO industry, optimizing for Nutch is far from a bad idea as long as you don't end up fouling up your optimizations for the big 3.


Google's webmaster guidelines
point out that you shouldn't have more than 150 links on any given page. SEOs have known for quite some time that google will read more than 150 links on a given page, but Nutch will actually only read 100 anchors from a page by default. This means that the idea of burying a link deep on a sitemap page or placing all of your navigational elements at the bottom of a document and using CSS to reposition them are optimization areas that need to be addressed.

One thing that I see Nutch doing right now - and I don't know that there is a planned long term fix for it - is that it boosts the base score of pages every time they are indexed - so frequently updated pages have an advantage over pages that are crawled less frequently. It seems to me that the Microsoft's engine has the same sort of "feature".

Nutch also heavily favors anchor text, like most modern search engines - and the more links the better. Starting to sound familiar?

What I don't see - or it least it hasn't become very apparent - is the ability to penalize domains, pages, or the scoring elements coming out of bad domains - which means that spamming techniques will pretty much own this search engine structure until it comes of age. (again, much like the MS engine... could they be using the software from this project?).

Some good information can be had by looking through the code for the page parsers too. You can glean information not just about html documents, but office documents and powerpoint as well. Certainly opportunity exists to optimize less popular document formats.



Join The Nutch and SEO and What it Means to You Discussion