Introduction to Lucene

I'm a Java developer and my IDE of choice has been Eclipse for a while now. I've always been fascinated with the things Eclipse supposedly does in the background and in particular the search functions/capabilities it provides to its users. Digging a bit more, I found that the Eclipse search was enhanced by the use of Apache Lucene (http://lucene.apache.org/core/). This made great reading and I wanted to share a few concepts that I learned along the way. I've also added a sample project with this blog which will search for a particular 'word' in a set of files.

What is Lucene?

Lucene is a Java based, full text search library which is not only highly performant but also works off a small RAM footprint. It also scales quite well. Developers should be able to embed this library in their applications and enjoy the search power that it provides.

Introduction to terms

Index
Document - A set of searchable fields which form the basis for indexing and search. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/document/Document.html

IndexWriter - Used to create and maintain an index. Its API can be found at http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexWriter.html.

IndexReader - An abstract class to access and read an index.API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html

FSDirectory - A base class directory to store index files. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/store/FSDirectory.html

Analyzer - analyzes text by breaking into tokens. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/Analyzer.html

Diagram

Simple Search example

The scope of this sample project is to demonstrate the basic search capabilities using Apache Lucene. It barely scrapes the surface and I would use future blogs to demonstrate some of the more complex and more interesting concepts this library offers.

I have also included a sample word list, which we shall use to perform our searches. This can be download here

1. Create a basic Maven project and name it 'Lucene'. Once this step is completed, your IDE would look like this,

2. The next step is add the Lucene maven dependencies to your pom file,

3. The next thing we need is to download a sample dictionary file, which can be downloaded here.
Unzip this file and places the folder in £{project_home}/src/main/resources

4. Now onto the actual Java sources. The first step is to create an index of the dictionary files using an IndexWriter. Do this, we create a class which would accept the data directory on the file system and read them recursively for any files which have an extension of '.txt'.

	package com.gb.search.lucene;

	import java.io.File;
	import java.io.FileReader;
	import java.io.IOException;

	import org.apache.lucene.analysis.standard.StandardAnalyzer;
	import org.apache.lucene.document.Document;
	import org.apache.lucene.document.Field;
	import org.apache.lucene.index.IndexWriter;
	import org.apache.lucene.index.IndexWriterConfig;
	import org.apache.lucene.store.FSDirectory;
	import org.apache.lucene.util.Version;

	public class SimpleFileIndexer {

	public int index(File indexDir, File dataDir, String suffix) throws IOException {

	IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47));
	IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir), config);
	int fileCtr = 0;
	try {
	// clear out all documents from the index
	indexWriter.deleteAll();

	// perform indexing
	indexDirectory(indexWriter, dataDir, suffix);

	// commit your changes here
	indexWriter.commit();
	} catch (Exception e) {
	// in case of an exception, rollback any changes
	indexWriter.rollback();
	} finally {
	//NOTE - Dont forget to close the index writer
	fileCtr = indexWriter.numDocs();
	indexWriter.close();
	}
	return fileCtr;
	}

	/**
	* Recursively index files in a directory.
	* @param indexWriter
	* @param dataDir
	* @param suffix
	* @throws IOException
	*/
	private void indexDirectory(IndexWriter indexWriter, File dataDir,
	String suffix) throws IOException {
	File[] files = dataDir.listFiles();
	for (File file : files) {
	if (file.isDirectory()) {
	indexDirectory(indexWriter, file, suffix);
	} else {
	indexFileWithIndexWriter(indexWriter, file, suffix);
	}
	}
	}

	@SuppressWarnings("deprecation")
	private void indexFileWithIndexWriter(IndexWriter indexWriter, File file,
	String suffix) throws IOException {
	if (file.isHidden() \|\| file.isDirectory() \|\| !file.canRead() \|\| !file.exists()) {
	return;
	}
	if (suffix != null && !file.getName().endsWith(suffix)) {
	return;
	}
	System.out.println("Indexing file : " + file.getCanonicalPath());
	Document document = new Document();
	document.add(new Field("contents", new FileReader(file)));
	document.add(new Field("fileName", file.getCanonicalPath(), Field.Store.YES, Field.Index.ANALYZED));

	indexWriter.addDocument(document);
	}
	}

view raw SimpleFileIndexer hosted with ❤ by GitHub

The above code does a few things. As a first step, it declares a config object which also has a reference to an Analyzer. A new index directory is created based on the directory parameter. The config and the index directory are then used to create an IndexWriter.

We first start of by clearing all documents held in the indexwriter and then add files recursively from the directory specified. Once this is done, we commit the index write process and also close the writer in a finally block.

	package com.gb.search.lucene;

	import java.io.File;
	import java.io.IOException;
	import java.net.URL;

	public class IndexWriteTester {

	public static void main(String[] args) throws IOException {

	URL dictUrl = IndexWriteTester.class.getResource("/12dicts-5.0");
	File indexDir = new File("lucene-index");
	System.out.println(indexDir.getAbsolutePath());
	File dataDir = new File(dictUrl.getFile());
	String suffix = "txt";

	SimpleFileIndexer indexer = new SimpleFileIndexer();

	int numIndex = indexer.index(indexDir, dataDir, suffix);
	System.out.println("Indexed " + numIndex + " files");
	}

	}

view raw IndexWriteTester hosted with ❤ by GitHub

Once this is done, we can see a list of files that were indexed and also a count of the number of indexed files on the Eclipse console.

	package com.gb.search.lucene;

	import java.io.BufferedReader;
	import java.io.File;
	import java.io.FileReader;
	import java.io.IOException;
	import java.io.LineNumberReader;

	import org.apache.lucene.analysis.Analyzer;
	import org.apache.lucene.analysis.standard.StandardAnalyzer;
	import org.apache.lucene.document.Document;
	import org.apache.lucene.index.IndexReader;
	import org.apache.lucene.queryparser.classic.ParseException;
	import org.apache.lucene.queryparser.classic.QueryParser;
	import org.apache.lucene.search.IndexSearcher;
	import org.apache.lucene.search.Query;
	import org.apache.lucene.search.ScoreDoc;
	import org.apache.lucene.search.TopDocs;
	import org.apache.lucene.store.Directory;
	import org.apache.lucene.store.FSDirectory;
	import org.apache.lucene.util.Version;

	public class SimpleFileSearcher {

	public void searchIndex(File indexDir, String queryStr, int maxHits) throws IOException, ParseException {
	LineNumberReader lineNumberReader = null;
	try {
	Directory directory = FSDirectory.open(indexDir);
	IndexReader indexReader = IndexReader.open(directory);
	IndexSearcher indexSearcher = new IndexSearcher(indexReader);
	Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47);
	QueryParser queryParser = new QueryParser(Version.LUCENE_47, "contents", analyzer);
	Query query = queryParser.parse(queryStr);
	TopDocs topDocs = indexSearcher.search(query, maxHits);

	ScoreDoc[] hits = topDocs.scoreDocs;
	int totalHits = topDocs.totalHits;

	for (ScoreDoc doc : hits) {
	int docId = doc.doc;
	Document document = indexSearcher.doc(docId);
	String fileName = document.get("fileName");
	File file = new File(fileName);
	System.out.println(fileName);
	lineNumberReader = new LineNumberReader(new BufferedReader(new FileReader(file)));
	String currentLine = lineNumberReader.readLine();
	while ((currentLine=lineNumberReader.readLine()) != null) {
	if (currentLine.contains(queryStr)) {
	System.out.println("\tMatch " + queryStr +" found in line " + lineNumberReader.getLineNumber() + " in file " + fileName + " which looks like " + currentLine);
	}
	}
	}

	System.out.println("Found " + totalHits);
	} catch (Exception e) {
	e.printStackTrace();
	} finally {
	lineNumberReader.close();
	}
	}

	}

view raw SimpleFileSearcher.java hosted with ❤ by GitHub

	package com.gb.search.lucene;

	import java.io.File;
	import java.io.IOException;

	import org.apache.lucene.queryparser.classic.ParseException;

	public class IndexReadTester {

	public static void main(String[] args) throws IOException, ParseException {
	File indexDir = new File("lucene-index");
	String queryStr = "apple";
	int maxHits = 100;
	SimpleFileSearcher simpleSearcher = new SimpleFileSearcher();
	simpleSearcher.searchIndex(indexDir, queryStr, maxHits);

	}

	}

view raw IndexReadTester.java hosted with ❤ by GitHub

Conclusion

Introduction to Lucene

Monday, 6 October 2014

Introduction to Apache Lucene.