Monday, 6 October 2014

Introduction to Apache Lucene.



I'm a Java developer and my IDE of choice has been Eclipse for a while now. I've always been fascinated with the things Eclipse supposedly does in the background and in particular the search functions/capabilities it provides to its users. Digging a bit more, I found that the Eclipse search was enhanced by the use of Apache Lucene (http://lucene.apache.org/core/). This made great reading and I wanted to share a few concepts that I learned along the way. I've also added a sample project with this blog which will search for a particular 'word' in a set of files.

What is Lucene? 

Lucene is a Java based, full text search library which is not only highly performant but also works off a small RAM footprint. It also scales quite well. Developers should be able to embed this library in their applications and enjoy the search power that it provides. 


Introduction to terms

Index
Document - A set of searchable fields which form the basis for indexing and search. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/document/Document.html

IndexWriter - Used to create and maintain an index. Its API can be found at http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexWriter.html.

IndexReader - An abstract class to access and read an index.API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html

FSDirectory - A base class directory to store index files. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/store/FSDirectory.html

Analyzer - analyzes text by breaking into tokens. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/Analyzer.html

Diagram



Simple Search example

The scope of this sample project is to demonstrate the basic search capabilities using Apache Lucene. It barely scrapes the surface and I would use future blogs to demonstrate some of the more complex and more interesting concepts this library offers.

I have also included a sample word list, which we shall use to perform our searches. This can be download here

1. Create a basic Maven project and name it 'Lucene'. Once this step is completed, your IDE would look like this,

2. The next step is add the Lucene maven dependencies to your pom file,




 3. The next thing we need is to download a sample dictionary file, which can be downloaded here.
     Unzip this file and places the folder in £{project_home}/src/main/resources

4. Now onto the actual Java sources. The first step is to create an index of the dictionary files using an IndexWriter. Do this, we create a class which would accept the data directory on the file system and read them recursively for any files which have an extension of '.txt'.
 


The above code does a few things. As a first step, it declares a config object which also has a reference to an Analyzer. A new index directory is created based on the directory parameter. The config and the index directory are then used to create an IndexWriter.

We first start of by clearing all documents held in the indexwriter and then add files recursively from the directory specified. Once this is done, we commit the index write process and also close the writer in a finally block.  





Once this is done, we can see a list of files that were indexed and also a count of the number of indexed files on the Eclipse console.








Conclusion