![]() |
| | #1 | |||||||
| Movable Type Integration Join Date: Feb 2007
Posts: 265
![]() | Nutch as a site search tool Quote:
__________________ You're friendly neighborhood automation routine. Last edited by MT Integration : 08-16-2007 at 02:44 AM. Reason: Update to original article | |||||||
| | |
| | #2 | ||
| Runs This Show | After playing around with this, the internationalization properties of Nutch are a huge PITA, and getting rid of them is none too easy. What I thought was a decent implementation of reverse proxy-ability doesn't work very well in IE at all. In the best case scenario, I'm going to have to hard code links into the pages because... IE7 does not play well with Nutch's internationalization - or a few other features for that matter. It makes me think to the Solr implementation where they give you a few response formats and it's up to you to come up with a presentation layer - it doesn't sound like so much of a bad idea any more. My URL: http://www.stevekallestad.com | ||
| | |
| | #3 | ||
| Junior Member Join Date: May 2008
Posts: 1
![]() | Hello We are trying to implement Nutch in our tool. We have successfully crawled and indexed our local drive for the required contents. Now we have no clue about the implementation part. Out tool is developed in C and C++. I could download the sample Nutch API for implementation but I have no clue about the implementation part nor do I understand the classes. Any help related to Nutch API implementation will be extremely beneficial for me. Thanks in advance for the help. Jaya | ||
| | |
| | #4 | ||
| Runs This Show | Frankly, I'm in the dark as far as the nutch API is concerned - I haven't played around with the tool a whole heck of a lot recently. It sounds like you are trying to use Nutch as a search facility for an application - but everything I've seen is that Nutch is geared specifically towards web searching. Lucene is the search engine behind nutch, and it's difficult to understand the difference between the two when first looking at things, but consider Lucene the search tool, and consider Nutch the presentation layer. So Nutch provides you with the search page and the search results page. Actually, I've oversimplified because Nutch also provides the web crawling engine and lucene configuration as well. As far as Lucene is concerned, Lucene holds the search indexes. It's essentially a very simple database - the tokens in lucene are like columns. You can gear searches to weigh certain tokens more than others - like favoring h1 elements and title elements over the other textual content in a web page. There is a C API for lucene, but you'll have to read a bit about how lucene works to understand the API. Personally, I use the java version because it is maintained better, and I just proxy my search requests through java. The C API last I looked was a couple of versions out of date, and fairly unmaintained. Same with Perl. I would start out by reading the concepts page. Then I'd take a look at the FAQ. From there, take a look through the various command line utilities - they are underdocumented, so take some time looking at each one and then push out questions to the list. There's also Lucli - which is another set of command line utilities - undocumented, but there is command line help. | ||
| | |