I'm a Java developer and my IDE of choice has been Eclipse for a while now. I've always been fascinated with the things Eclipse supposedly does in the background and in particular the search functions/capabilities it provides to its users. Digging a bit more, I found that the Eclipse search was enhanced by the use of Apache Lucene (http://lucene.apache.org/core/). This made great reading and I wanted to share a few concepts that I learned along the way. I've also added a sample project with this blog which will search for a particular 'word' in a set of files.
What is Lucene?
Lucene is a Java based, full text search library which is not only highly performant but also works off a small RAM footprint. It also scales quite well. Developers should be able to embed this library in their applications and enjoy the search power that it provides.
Introduction to terms
Index
Document - A set of searchable fields which form the basis for indexing and search. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/document/Document.html
IndexWriter - Used to create and maintain an index. Its API can be found at http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexWriter.html.
IndexReader - An abstract class to access and read an index.API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html
FSDirectory - A base class directory to store index files. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/store/FSDirectory.html
Analyzer - analyzes text by breaking into tokens. API ref - http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/analysis/Analyzer.html
Diagram
Simple Search example
The scope of this sample project is to demonstrate the basic search capabilities using Apache Lucene. It barely scrapes the surface and I would use future blogs to demonstrate some of the more complex and more interesting concepts this library offers.
I have also included a sample word list, which we shall use to perform our searches. This can be download here
1. Create a basic Maven project and name it 'Lucene'. Once this step is completed, your IDE would look like this,
2. The next step is add the Lucene maven dependencies to your pom file,
3. The next thing we need is to download a sample dictionary file, which can be downloaded here.
Unzip this file and places the folder in £{project_home}/src/main/resources
4. Now onto the actual Java sources. The first step is to create an index of the dictionary files using an IndexWriter. Do this, we create a class which would accept the data directory on the file system and read them recursively for any files which have an extension of '.txt'.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.gb.search.lucene; | |
import java.io.File; | |
import java.io.FileReader; | |
import java.io.IOException; | |
import org.apache.lucene.analysis.standard.StandardAnalyzer; | |
import org.apache.lucene.document.Document; | |
import org.apache.lucene.document.Field; | |
import org.apache.lucene.index.IndexWriter; | |
import org.apache.lucene.index.IndexWriterConfig; | |
import org.apache.lucene.store.FSDirectory; | |
import org.apache.lucene.util.Version; | |
public class SimpleFileIndexer { | |
public int index(File indexDir, File dataDir, String suffix) throws IOException { | |
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, new StandardAnalyzer(Version.LUCENE_47)); | |
IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir), config); | |
int fileCtr = 0; | |
try { | |
// clear out all documents from the index | |
indexWriter.deleteAll(); | |
// perform indexing | |
indexDirectory(indexWriter, dataDir, suffix); | |
// commit your changes here | |
indexWriter.commit(); | |
} catch (Exception e) { | |
// in case of an exception, rollback any changes | |
indexWriter.rollback(); | |
} finally { | |
//NOTE - Dont forget to close the index writer | |
fileCtr = indexWriter.numDocs(); | |
indexWriter.close(); | |
} | |
return fileCtr; | |
} | |
/** | |
* Recursively index files in a directory. | |
* @param indexWriter | |
* @param dataDir | |
* @param suffix | |
* @throws IOException | |
*/ | |
private void indexDirectory(IndexWriter indexWriter, File dataDir, | |
String suffix) throws IOException { | |
File[] files = dataDir.listFiles(); | |
for (File file : files) { | |
if (file.isDirectory()) { | |
indexDirectory(indexWriter, file, suffix); | |
} else { | |
indexFileWithIndexWriter(indexWriter, file, suffix); | |
} | |
} | |
} | |
@SuppressWarnings("deprecation") | |
private void indexFileWithIndexWriter(IndexWriter indexWriter, File file, | |
String suffix) throws IOException { | |
if (file.isHidden() || file.isDirectory() || !file.canRead() || !file.exists()) { | |
return; | |
} | |
if (suffix != null && !file.getName().endsWith(suffix)) { | |
return; | |
} | |
System.out.println("Indexing file : " + file.getCanonicalPath()); | |
Document document = new Document(); | |
document.add(new Field("contents", new FileReader(file))); | |
document.add(new Field("fileName", file.getCanonicalPath(), Field.Store.YES, Field.Index.ANALYZED)); | |
indexWriter.addDocument(document); | |
} | |
} |
The above code does a few things. As a first step, it declares a config object which also has a reference to an Analyzer. A new index directory is created based on the directory parameter. The config and the index directory are then used to create an IndexWriter.
We first start of by clearing all documents held in the indexwriter and then add files recursively from the directory specified. Once this is done, we commit the index write process and also close the writer in a finally block.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.gb.search.lucene; | |
import java.io.File; | |
import java.io.IOException; | |
import java.net.URL; | |
public class IndexWriteTester { | |
public static void main(String[] args) throws IOException { | |
URL dictUrl = IndexWriteTester.class.getResource("/12dicts-5.0"); | |
File indexDir = new File("lucene-index"); | |
System.out.println(indexDir.getAbsolutePath()); | |
File dataDir = new File(dictUrl.getFile()); | |
String suffix = "txt"; | |
SimpleFileIndexer indexer = new SimpleFileIndexer(); | |
int numIndex = indexer.index(indexDir, dataDir, suffix); | |
System.out.println("Indexed " + numIndex + " files"); | |
} | |
} |
Once this is done, we can see a list of files that were indexed and also a count of the number of indexed files on the Eclipse console.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.gb.search.lucene; | |
import java.io.BufferedReader; | |
import java.io.File; | |
import java.io.FileReader; | |
import java.io.IOException; | |
import java.io.LineNumberReader; | |
import org.apache.lucene.analysis.Analyzer; | |
import org.apache.lucene.analysis.standard.StandardAnalyzer; | |
import org.apache.lucene.document.Document; | |
import org.apache.lucene.index.IndexReader; | |
import org.apache.lucene.queryparser.classic.ParseException; | |
import org.apache.lucene.queryparser.classic.QueryParser; | |
import org.apache.lucene.search.IndexSearcher; | |
import org.apache.lucene.search.Query; | |
import org.apache.lucene.search.ScoreDoc; | |
import org.apache.lucene.search.TopDocs; | |
import org.apache.lucene.store.Directory; | |
import org.apache.lucene.store.FSDirectory; | |
import org.apache.lucene.util.Version; | |
public class SimpleFileSearcher { | |
public void searchIndex(File indexDir, String queryStr, int maxHits) throws IOException, ParseException { | |
LineNumberReader lineNumberReader = null; | |
try { | |
Directory directory = FSDirectory.open(indexDir); | |
IndexReader indexReader = IndexReader.open(directory); | |
IndexSearcher indexSearcher = new IndexSearcher(indexReader); | |
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); | |
QueryParser queryParser = new QueryParser(Version.LUCENE_47, "contents", analyzer); | |
Query query = queryParser.parse(queryStr); | |
TopDocs topDocs = indexSearcher.search(query, maxHits); | |
ScoreDoc[] hits = topDocs.scoreDocs; | |
int totalHits = topDocs.totalHits; | |
for (ScoreDoc doc : hits) { | |
int docId = doc.doc; | |
Document document = indexSearcher.doc(docId); | |
String fileName = document.get("fileName"); | |
File file = new File(fileName); | |
System.out.println(fileName); | |
lineNumberReader = new LineNumberReader(new BufferedReader(new FileReader(file))); | |
String currentLine = lineNumberReader.readLine(); | |
while ((currentLine=lineNumberReader.readLine()) != null) { | |
if (currentLine.contains(queryStr)) { | |
System.out.println("\tMatch " + queryStr +" found in line " + lineNumberReader.getLineNumber() + " in file " + fileName + " which looks like " + currentLine); | |
} | |
} | |
} | |
System.out.println("Found " + totalHits); | |
} catch (Exception e) { | |
e.printStackTrace(); | |
} finally { | |
lineNumberReader.close(); | |
} | |
} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.gb.search.lucene; | |
import java.io.File; | |
import java.io.IOException; | |
import org.apache.lucene.queryparser.classic.ParseException; | |
public class IndexReadTester { | |
public static void main(String[] args) throws IOException, ParseException { | |
File indexDir = new File("lucene-index"); | |
String queryStr = "apple"; | |
int maxHits = 100; | |
SimpleFileSearcher simpleSearcher = new SimpleFileSearcher(); | |
simpleSearcher.searchIndex(indexDir, queryStr, maxHits); | |
} | |
} |
Conclusion