Lucene.Net Indexing Searching Entry Level Tutorial

Many .Net developers have been using Lucene.net since it first appeared on the scene after being directly ported from the JAVA lucene offering around 2004. This is an entry - level tutorial to get you up to speed quickly.

If you visit Codeplex.com you can find a number of (in some cases) quite sophisticated implementations of indexing and search engines using Lucene.net. However, getting any one of them into a state of “usability” for your own “stuff” may be very difficult because they are complex.


So while investigating Lucene.net for my own use as a potential alternative to say SQL Server FullText or perhaps DtSearch, I thought it would be a good idea to start instead from the ground – up, with just the basics.


Lucene.net is a complete, single assembly solution to virtually any indexing and searching need. The compiled assembly is only 440Kb. Lucene.net is also pretty fast - you'll see a milliseconds elapsed time displayed in the demo. And there's nothing to stop you from kicking off multiple threads to do your indexing provided you have sufficient locking semantics around the Write method of the indexer. It is 100% managed code; there is no COM Interop and no dependency on a database.


Lucene.Net is comprised of the following basic building blocks:


Directory – a Lucene – specific index directory on disk ( There are other options such as in RAM).
Analyzer – any of a number of different content analyzers. Here I use the StandardAnalyzer.
IndexWriter – Does just what it says – writes index entries into the Directory
IndexReader – Reads an index. Has many methods.
Document – this is the basic structure that is used to design the fields in your index items
IndexSearcher - Designed to implement searches on an index.
QueryParser - Parses a specified query so that the Searcher can use it.
Query - the object that represents the query item for the searcher to use.

That's really all you need to know to understand Lucene.net. There is more, but by starting out with the basics, it becomes much easier to grasp it all.

Indexing:

Here is how we would create an index from within a web page:

protected void btnIndex_Click(object sender, EventArgs e)
{
//Setup indexer

Directory directory = FSDirectory.GetDirectory(Server.MapPath( "~/LuceneIndex"));
Analyzer analyzer = new StandardAnalyzer();

//The third argument in the IndexWriter constructor is a boolean, which tells it to create the index if if doesn't already //exist (if true).
IndexWriter writer = new IndexWriter(directory, analyzer, true);
IndexReader red = IndexReader.Open(directory);
int totDocs = red.MaxDoc();
red.Close();

//Add documents to the index
string text = String.Empty;
string txts = totDocs.ToString( );
int j = 0;

WebClient wc = new WebClient();
string s = wc.DownloadString("http://feeds.feedburner.com/eggheadcafe/gWWU");
MemoryStream ms = new MemoryStream(System.Text.UTF8Encoding.UTF8.GetBytes(s));
XmlDocument doc = new XmlDocument();
doc.Load(ms);
Stopwatch sw = new Stopwatch();
sw.Start();
XmlNodeList nodes = doc.DocumentElement.SelectNodes("//item");
for (int i = 0; i < nodes.Count; i++)
{
AddTextToIndex(i, nodes[i], writer);
j++;
}
//Optimize our index
writer.Optimize();
//Close everybody
writer.Flush();
writer.Close();
directory.Close();
sw.Stop();
Label1.Text = j.ToString() + " entries added, " + txts + " documents total in " +
sw.ElapsedMilliseconds.ToString() + "ms";
}

First, we get our Directory. We create an Analyzer, an IndexWriter, and an IndexReader. Here, we're only using the IndexReader to get the total existing document count..

Normally, I would read the documents to be indexed out of a database. But to make the demo more portable, here I'm just getting one of our Eggheadcafe.com article feeds (the list of articles you see at the left on our home page) and I'll index that. My key method is the AddTextToIndex method:

private void AddTextToIndex(int txts, XmlNode node, IndexWriter writer)
{
Document doc = new Document();
doc.Add(new Field("id", txts.ToString(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("Description", (String)node.ChildNodes[3].InnerText , Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("Link", (string)node.ChildNodes[2].InnerText , Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.Add(new Field("Title", (string)node.ChildNodes[0].InnerText , Field.Store.YES, Field.Index.TOKENIZED));
writer.AddDocument(doc);

}

Your custom version of an AddTextToIndex method might have a DataRow parameter, for example, instead of an XmlNode.

We create a new Document, and then we add fields. The "id" field is a must-have. Then I add Description, Link, and Title fields, just as you would find in an RSS Item node. The Field.Store.YES enum just says to store the field in the document object into the written index on disk, and the TOKENIZED enum means that I want Lucene.net to pull it apart and analyze it so it can be used in a query. UN_TOKENIZED (such as for Link) means I need it stored so it comes back in my results, but I'm not going to search on it.

So essentially here I am indexing both the Description and the Title fields.You can get the documents to be indexed from anywhere you want - the file system, a Database, or here, some RSS feeds.

Searches

Searches are performed with the IndexSearcher and QueryParser objects:

protected void btnSearch_Click(object sender, EventArgs e)
{
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser("Description", analyzer);
//Supply conditions
Search(txtSearch.Text, searcher, parser);
}

private void Search(string text, IndexSearcher searcher, QueryParser parser)
{
if (text == null) text = ".NET";
//Supply conditions
Query query = parser.Parse(text);
//Do the search
Hits hits = searcher.Search(query);
int results = hits.Length();
Label1.Text ="Found " + results.ToString( ) +" results";
List<SearchResult> list = new List<SearchResult>();
SearchResult sr = null;
for (int i = 0; i < results; i++)
{
sr = new SearchResult();
Document doc = hits.Doc(i);
float score = hits.Score(i);
sr.Id = int.Parse(doc.Get("id"));
sr.Score = score;
sr.Description = doc.Get("Description");
sr.Title = doc.Get("Title");
sr.Link = doc.Get("Link");
list.Add(sr);
}
//sort by score
list = list.OrderByDescending(x => x.Score).ToList();
DataList1.DataSource = list;
DataList1.DataBind();
}

We get a query by having the parser parse the Search text. It's Search method returns the total hits. Here I am using a simple SearchResult class for databinding so I can assemble all the fields I want to display to the user:

public class SearchResult
{
public float Score { get; set; }
public int Id { get; set; }
public string Link { get; set; }
public string Description { get; set; }
public string Title { get; set; }
}

That's all there is to it. Pretty simple, really. Of course there are a lot of other classes and extensions such as highlighters and document summary creators, etc. that you can build out your creations with For example, to generate highlighted hit text, you could do this, using the contrib Highlighter assembly:

// in Search method:
sr.Description = doc.Get("Description");
// replace the body text with a highlighted preview:
string preview = GeneratePreviewText(query, sr.Description);
sr.Description = preview;

public string GeneratePreviewText(Query q, string text)
{
QueryScorer scorer = new QueryScorer(q);
Lucene.Net.Highlight.Formatter formatter =
new Lucene.Net.Highlight.SimpleHTMLFormatter("<span style='background:yellow;'>", "</span>");
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.SetTextFragmenter(new SimpleFragmenter(250));
TokenStream stream = new StandardAnalyzer().TokenStream("Description",new StringReader(text));
return highlighter.GetBestFragments(stream, text, 4, "<br/>");
}

You can download the working Visual Studio 2010 Solution here. It includes a release build of the latest Lucene.Net assembly along with the Highliter assembly and the Highlight code shown above.

By Peter Bromberg   Popularity  (6972 Views)