Go Back   Steve Kallestad.com Discussion > Open Discussion > Article Talk


Post New Thread  Reply
 
LinkBack Thread Tools Display Modes
Old 02-07-2007, 06:45 PM   #1
Movable Type Integration
 
Join Date: Feb 2007
Posts: 265
MT Integration is on a distinguished road
Talking Discussion: Nutch as a site search tool

Nutch as a site search tool
Quote:
My own experiences with implementing Nutch as a site search utility. The product is very effective, but it was a bit problematic to implement.
Related On SiteRelated External

None

None
__________________
You're friendly neighborhood automation routine.

Last edited by MT Integration : 08-16-2007 at 02:44 AM. Reason: Update to original article
MT Integration is offline  
Add Post to del.icio.usFurl this Post!
Reply With Quote
Old 02-08-2007, 01:47 AM   #2
Runs This Show
 
Steve's Avatar
 
Join Date: Dec 2006
Recent Blog: Where to Go From Here
Posts: 183
Steve has disabled reputation
Post Re: Discussion: Nutch as a site search tool

After playing around with this, the internationalization properties of Nutch are a huge PITA, and getting rid of them is none too easy. What I thought was a decent implementation of reverse proxy-ability doesn't work very well in IE at all.

In the best case scenario, I'm going to have to hard code links into the pages because... IE7 does not play well with Nutch's internationalization - or a few other features for that matter.

It makes me think to the Solr implementation where they give you a few response formats and it's up to you to come up with a presentation layer - it doesn't sound like so much of a bad idea any more.
My URL: http://www.stevekallestad.com
Steve is offline  
Add Post to del.icio.usFurl this Post!
Reply With Quote
Old 05-06-2008, 10:46 PM   #3
Junior Member
 
Join Date: May 2008
Posts: 1
jaya is on a distinguished road
Default Re: Discussion: Nutch as a site search tool

Hello

We are trying to implement Nutch in our tool. We have successfully crawled and indexed our local drive for the required contents.
Now we have no clue about the implementation part. Out tool is developed in C and C++. I could download the sample Nutch API for implementation but I have no clue about the implementation part nor do I understand the classes. Any help related to Nutch API implementation will be extremely beneficial for me.

Thanks in advance for the help.

Jaya
jaya is offline  
Add Post to del.icio.usFurl this Post!
Reply With Quote
Old 05-07-2008, 07:26 AM   #4
Runs This Show
 
Steve's Avatar
 
Join Date: Dec 2006
Recent Blog: Where to Go From Here
Posts: 183
Steve has disabled reputation
Default Re: Discussion: Nutch as a site search tool

Frankly, I'm in the dark as far as the nutch API is concerned - I haven't played around with the tool a whole heck of a lot recently.

It sounds like you are trying to use Nutch as a search facility for an application - but everything I've seen is that Nutch is geared specifically towards web searching.

Lucene is the search engine behind nutch, and it's difficult to understand the difference between the two when first looking at things, but consider Lucene the search tool, and consider Nutch the presentation layer. So Nutch provides you with the search page and the search results page. Actually, I've oversimplified because Nutch also provides the web crawling engine and lucene configuration as well.

As far as Lucene is concerned, Lucene holds the search indexes. It's essentially a very simple database - the tokens in lucene are like columns. You can gear searches to weigh certain tokens more than others - like favoring h1 elements and title elements over the other textual content in a web page.

There is a C API for lucene, but you'll have to read a bit about how lucene works to understand the API. Personally, I use the java version because it is maintained better, and I just proxy my search requests through java. The C API last I looked was a couple of versions out of date, and fairly unmaintained. Same with Perl.

I would start out by reading the concepts page. Then I'd take a look at the FAQ. From there, take a look through the various command line utilities - they are underdocumented, so take some time looking at each one and then push out questions to the list. There's also Lucli - which is another set of command line utilities - undocumented, but there is command line help.
Steve is offline  
Add Post to del.icio.usFurl this Post!
Reply With Quote