Monday, May 27, 2013

How to Test RavenDB Indexes

What if you could spin up an entire database in memory for every unit test? You can!

RavenDB offers an EmbeddableDocumentStore NuGet Package that allows you to create a complete in memory instance of RavenDB. This makes writing integration tests for your custom Indexes extremely easy.

The Hibernating Rhinos team makes full use of this feature by including a full suite of unit tests in their RavenDB solution. They even encourage people to submit pull requests to their GitHub repository so that they can pull those tests directly into their source. This is a BRILLIANT integration of all these technologies to both encourage testing and provide an extremely stable product.

So then, how do you test your RavenDB indexes? Great question; let's get into the code!

  1. Define your Document
public class Doc
{
    public int InternalId { get; set; }
    public string Text { get; set; }
}
  1. Define your Index
public class Index : AbstractIndexCreationTask<Doc>
{
    public Index()
    {
        Map = docs => from doc in docs
                      select new
                      {
                          doc.InternalId,
                          doc.Text
                      };
        Analyzers.Add(d => d.Text, "Raven.Extensions.AlphanumericAnalyzer");
    }
}
  1. Create your EmbeddableDocumentStore
  2. Insert your Index and Documents

In this example I am creating an abstract base class for my unit tests. The GetDocumentStore method provides an EmbeddableDocumentStore that comes pre-initialized with the default RavenDocumentsByEntityName index, your custom index, and a complete set of documents that have already been inserted. The Documents come from an abstract Documents property, which we will see implemented below in step 5.

protected abstract ICollection<TDoc> Documents { get; }
 
private EmbeddableDocumentStore NewDocumentStore()
{
    var documentStore = new EmbeddableDocumentStore
    {
        Configuration =
        {
            RunInUnreliableYetFastModeThatIsNotSuitableForProduction = true,
            RunInMemory = true
        }
    };
 
    documentStore.Initialize();
 
    // Create Default Index
    var defaultIndex = new RavenDocumentsByEntityName();
    defaultIndex.Execute(documentStore);
 
    // Create Custom Index
    var customIndex = new TIndex();
    customIndex.Execute(documentStore);
 
    // Insert Documents from Abstract Property
    using (var bulkInsert = documentStore.BulkInsert())
        foreach (var document in Documents)
            bulkInsert.Store(document);
 
    return documentStore;
}
  1. Write your Tests

These tests are testing a custom Alphanumeric analyzer. They will take in a series of lucene queries and assert that they match the correct internal Ids. These documents are being defined by our abstract Documents property from Step 4.

NOTE: Do not forget to include the WaitForNonStaleResults method on your queries, as your index may not be done building the first time you run your tests.

[Theory]
[InlineData(@"Text:Hello",              new[] {0})]
[InlineData(@"Text:file_name",          new[] {2})]
[InlineData(@"Text:name*",              new[] {2, 3})]
[InlineData(@"Text:my AND Text:txt",    new[] {2, 3})]
public void Query(string query, int[] expectedIds)
{
    int[] actualIds;
 
    using (var documentStore = NewDocumentStore())
    using (var session = documentStore.OpenSession())
    {
        actualIds = session.Advanced
            .LuceneQuery<Doc>("Index")
            .Where(query)
            .SelectFields<int>("InternalId")
            .WaitForNonStaleResults()
            .ToArray();
    }
 
    Assert.Equal(expectedIds, actualIds);
}
 
protected override ICollection<Doc> Documents
{
    get
    {
        return new[]
            {
                "Hello, world!",
                "Goodnight...moon?",
                "my_file_name_01.txt",
                "my_file_name01.txt"
            }
            .Select((t, i) => new Doc
            {
                InternalId = i,
                Text = t
            })
            .ToArray();
    }
}
Shout it

Enjoy,
Tom

Sunday, May 12, 2013

Alphanumeric Lucene Analyzer for RavenDB

RavenDB's full text indexing uses Lucene.Net

RavenDB is a second generation document database. This means that you can to throw typeless documents into a data store, but the only way to query them is by indexes that are built with Lucene.Net. RavenDB is a wonderful product that's primary strength is it's simplicity and easy of use. In keeping with that theme, even when you need to customize RavenDB, it makes it relatively easy to do.

So, let's talk about customizing your Lucene.Net analyzer in RavenDB!

Available Analyzers

RavenDB comes equipped with all of the analyzers that are built into Lucene.Net. For the vast majority of use cases, these will do the job! Here are some examples:

  • "The fox jumped over the lazy dogs, Bob@hotmail.com 123432."
  • StandardAnalyzer, which is Lucene's default, will produce the following tokens:
    [fox] [jumped] [over] [lazy] [dog] [bob@hotmail.com] [123432]
  • SimpleAnalyzer will tokenize on all non-alpha characters, and will make all the tokens lowercase:
    [the] [fox] [jumped] [over] [the] [lazy] [dogs] [bob] [hotmail] [com]
  • WhitespaceAnalyzer will just tokenize on white spaces:
    [The] [fox] [jumped] [over] [the] [lazy] [dogs,] [Bob@hotmail.com]
    [123432.]

In order to resolve an issue with indexing file names (details below), I found myself in need of an Alphanumeric analyzer. This analyzer would be similar to the SimpleAnalyzer, but would still respect numeric values.

  • AlphanumericAnalyzer will tokenize on the .NET framework's Char.IsDigitOrLetter:
    [fox] [jumped] [over] [lazy] [dogs] [bob] [hotmail] [com] [123432]

Lucene.Net's base classes made this pretty easy to build...

How to Implement a Custom Analyzer

Grab all the code and more from GitHub:

Raven.Extensions.AlphanumericAnalyzer on GitHub

A lucene analyzer is made of two basic parts, 1) a tokenizer, and 2) a series of filters. The tokenizer does the lions share of the work and splits the input apart, then the filters run in succession making additional tweaks to the tokenized output.

To create the Alphanumeric Analyzer we need only create two classes, an analyzer and a tokenizer. After that the analyzer can use reuse the existing LowerCaseFilter and StopFilter classes.

AlphanumericAnalyzer

public sealed class AlphanumericAnalyzer : Analyzer
{
    public AlphanumericAnalyzer(Version matchVersion, ISet<string> stopWords)
    {
        _enableStopPositionIncrements = StopFilter
            .GetEnablePositionIncrementsVersionDefault(matchVersion);
        _stopSet = stopWords;
    }
 
    public override TokenStream TokenStream(String fieldName, TextReader reader)
    {
        TokenStream tokenStream = new AlphanumericTokenizer(reader);
        tokenStream = new LowerCaseFilter(tokenStream);
        tokenStream = new StopFilter(
            _enableStopPositionIncrements, 
            tokenStream, 
            _stopSet);
 
        return tokenStream;
    }

AlphanumericTokenizer

public class AlphanumericTokenizer : CharTokenizer
{
    protected override bool IsTokenChar(char c)
    {
        return Char.IsLetterOrDigit(c);
    }

How to Install Plugins in RavenDB

Installing a custom plugin to RavenDB is unbelievably easy. Just compile your assembly, and then drop it into the Plugins folder at the root of your RavenDB server. You may then reference the analyzers in your indexes by using their fully assembly qualified names.

Again, you can grab all of the code and more over on GitHub:

Raven.Extensions.AlphanumericAnalyzer on GitHub

Shout it

Enjoy,
Tom

Real Time Web Analytics