Lucene

From KallestadWiki

Jump to: navigation, search

Now that I've seen and used the power of Nutch, it's time for me to delve deeper into the capabilities of Lucene - which is the core search engine behind Nutch's web search layer.

The Lucene documentation is pretty dry, but it is still an important read.

Lucene is an apache project, like the famous web server. It seems most of the slashdot crowd is well aware of it's existence, but as I talk with colleagues very few are aware of Lucene, and fewer still are considering implementations. I suppose part of that has to do with the general distaste for Java that most people I run into have. I have a bit of a distaste myself, but frankly I haven't run into anything open source that is nearly as powerful as Lucene, and believe me I have looked. I'm not a stranger to java, so I can make the occasional change as necessary, but frankly my desire is to utilize Lucene out-of-the-box and identify the necessary integration points that I can work with in order to utilize Lucene as a search engine for non-java based projects.

Contents

Getting and Installing Java

I want to run the latest and greatest Lucene since I'll be looking forward for any implementations. The Lucene FAQ says that Java 1.4 is required. This is where java starts to frustrate me. I have a clean virtual machine and because of Sun's licensing requirements, it's not included as part of any distro. Past experience tells me I should be working with Sun's java - either for compatibility or performance reasons. If I go to Java.com, I'm presented with one possible download, but... I honestly don't know if they've updated things so that I can just use that Java, or if that's just the JRE (runtime environment), and I need to go to Sun.com to download the JDK. Of course, when I hit up sun's website, now there's not just the JRE and the JDK, but now there's Java SE, Java EE, Java ME, and no reasonable way to discern what's what. Fine, I'll get through it - I'm motivated. But that means when I have an application that uses Lucene, I will have to walk my potential end user's through Suns web site and java installation.

Eventually, I ended up here which is the download link for the Java 6 JDK. Wait, didn't I say 1.4 is required? Isn't 6 WAY different than 1.4? Yup. Yay for java once again. If I remember right (and don't quote me on this), Sun renumbered things because of all the confusion, so 6 is actually 1.6. Still a bit off from the requirement, but the Lucene FAQ has problems too - it's on a Wiki and who knows when that particular FAQ has been updated. It doesn't specify whether the requirement is specifically 1.4 or if it's at least 1.4. The one good thing about Java in linux is that Alternatives can be used to change java versions easily.

The process essentially for Redhat EL 5 / CentOS 5 is to download java, install the RPM, then download a compatibility rpm from somewhere like here (which apparently sets up alternatives correctly), then run

/usr/sbin/alternatives --config java

and select your appropriate version. Beware that it could take a few minutes for alternatives to set everything up correctly, so wait a little bit of time before running

java -version

to see if everything works. Also, it's a good idea to set up your environment

export JAVA_HOME=/usr/java/jdk-xxxxx
export JAVA_PATH=/usr/java/jdk-xxxxx
export PATH=$PATH:/usr/java/jdk-xxxxx

Getting and Installing Ant

Ant is a build tool like make - but it offers a few benefits that make doesn't offer. It's extend-able, and it's cross-platform. And some apache projects like Lucene expect it to be there.

First - download Ant - http://ant.apache.org/bindownload.cgi

Now, extract it

tar -xvzf apache-ant-xxxx-bin.tar.gz

setup your environment variables:

export PATH=$PATH:/install-dir/bin
export ANT_HOME=/install-dir

startup files

Since we have a few environment variables at play here, it's a good idea to edit your .bashrc or .cshrc or the startup file for whatever environment you are running in by default. It's also not a bad idea to create a script to set these environment variables directly if you happen to be logged in as a different user or you want to share with somebody else.

Remember when setting up scripts - you can't just run something that sets environment variables directly - you have to run the source. so instead of:

./javaenv

do

. ./javaenv

My javaenv file:

#!/bin/bash
JAVA_HOME=/usr/java/jdk1.6.0_06
JAVA_PATH=/usr/java/jdk1.6.0_06
PATH=$PATH:/usr/java/jdk1.6.0_06:/usr/java/jdk1.6.0_06/bin:/home/myuser/ant/ant/bin
ANT_HOME=/home/myuser/ant/ant
export JAVA_HOME JAVA_PATH PATH ANT_HOME

Getting Lucene

Lucene can be downloaded or checked out. To check out the source you have to use svn. (for more information on svn, read the SVN Book

mkdir ~/lucene
mkdir ~/lucene/svn
cd ~/lucene/svn
svn checkout http://svn.apache.org/repos/asf/lucene/java/trunk lucene/java/trunk

Building Lucene

I went the svn route, so I have to build lucene.

cd /checkoutdir/lucene/java/trunk
ant war-demo

The CLASSPATH variable

In the ant documentation, there is a warning to have the CLASSPATH variable elimintated. In the Lucene documentation, you need it. What gives?

Well, the Classpath variable tells java where to find things. When you are running ANT, everything is findable in an easily discernable location - and frequently people set the CLASSPATH variable as required by their particular application. With Lucene, at least to get the basic Demo working and to run the commands as specified within the demo documentation, you have to have it. So the story goes - when building with ANT either run with the -noclasspath option or eliminate the environment variable. When running Lucene - either set the CLASSPATH environment variable appropriately, or insert it via the java command line via the -classpath option.

The CLASSPATH for lucene needs to point to

lucene-core-{version}.jar

and for the demos:

lucene-demos-[version}.jar

so for me in this particular instance

export CLASSPATH=/checkoutdir/trunk/build/lucene-core-2.4-dev.jar:/checkoutdir/lucene/java/trunk/build/lucene-demos-2.4-dev.jar

where checkoutdir is the directory in which I checked out the svn version of lucene. That's a lot to type, so I created a bash script including that text and placed it in ~/lucyclasspath

Running the demos

I created a demo folder under ~/

mkdir ~/demo
cd ~/demo

Build an index

I appropriately set the classpath environment variable above, so my command to index all the lucene source code is as follows

java org.apache.lucene.demo.IndexFiles {full-path-to-source-directory}

or

java org.apache.lucene.demo.IndexFiles /home/myuser/lucene/svn/lucene/java/trunk/src

The total index build on my virtual machine took a little over 9 seconds.

Essentially, what happened is an index folder was created inside my ~/demo directory. That index folder contains the indexes for all of the documents I indexed. The Lucene documentation on file formats actually spells out the contents of these folders with surprising clarity.

Search an index

So I've built an index, now I want to search it. The demo has a wrapper around the search functionality that does query prompting and a basic display of information about the files:

java org.apache.lucene.demo.SearchFiles

Once you run that, you will be prompted with:

Enter query:

type in a word (an example would be: query) or a term and you will be presented with a top 10 list of matches (or no matches as the case may be). If there are more than 10 matches, you can page through them, jump to a certain page, or opt to quit.

Moving On

At this point, I've verified lucene functionality, now it's time to delve a little bit deeper. There is a basic web demo that follows the same path, but I'm not so much interested in that kind of a thing at the moment. I'm more interested in building a more structured index so that I can integrate this search functionality within other applications - besides for an easier go at web site searching, Nutch is a great solution.

I did find a solid article which explains the process of building documents and indexes in a little more detail than I can find on the Apache site. Integrate advanced search functionalities at JavaWorld. It appears from what I've gleaned as if the document structure that you're using must be specified with java and not arbitrarily defined. That doesn't look like too much of a problem, as the code is fairly simple and it looks like it can be generated easily.

I also found a few products that are of interest at this point - DBSight, and Hibernate. Each of these products utilizes Lucene as an underlying search engine, and all three of them provide a layer of abstraction on top of Lucene - presumably to provide an easier path to indexing and searching complex objects. DBSight has both free and paid versions, and it looks like the major difference between the two options are support and remote replication of indexes. Hibernate has won a few awards, but it also seems like a very large and complex product to work with on initial review. There's also Compass, but Compass seems as light on documentation as Lucene.

Another Lucene related product that looks interesting is Carrot2. Carrot provides results clustering - which I know little about, but the concept is very interesting. My initial reaction is this is yet another solid project with no documentation targetting people who might actually want to utilize the project.

Full Text Search Index

Personal tools