Snowball Analyzer for Sitecore

During an assessment of a customer’s Sitecore implementation, I re-aquainted myself with an old familiar area in the Sitecore.ContentSearch.LuceneProvider assembly. For some previous projects, now a year ago or longer, I spent *days* digging through the ContentSearch library working on some very specific tuning scenarios. This past week, I had a chance to revisit my notes from these efforts and formulate a recommendation for a customer looking to improve their Sitecore search performance.

In this particular case, a customer was frustrated by the Content Management side of Sitecore and some of the particularities around search.  Besides sharing the Sitecore Content Author’s Cookbook and encouraging them to thoroughly review the pieces about special characters in the search query, another aspect we assisted with was their evaluation of alternative Analyzers for search with Sitecore’s Lucene provider.

SnowballAnalyzer (Lucene.Net.Analysis.Snowball.SnowballAnalyzer)

With the SnowballAnalyzer, included with Sitecore installations as part of the Lucene.Net.Contrib.Snowball assembly that comes with Sitecore ootb, if one searches for “apples” and the text contains the value “apple” — but not the plural — the “apple” result will be returned.  Internally, apples and apple are reduced by the SnowballAnalyzer to appl.  Furthermore, using the SnowballAnalyzer, if text contains the word “weakness” in the value and the “weak” or “weakly” search query is executed, it would match the item.  Try it out for yourself at http://snowballstem.org/demo.html — try out raining, rained, and rain and see how they all reduce to the same core courtesy of the SnowballAnalyzer logic.

To put the SnowballAnalyzer to work for your Sitecore implementation, there are a couple approaches.  In this case, I created a patch config file to shift the sitecore_master_index to use Snowball instead.  The key element is the analyzer definition for the specific index (I would NOT suggest editing the defaultLuceneIndexConfiguration directly — I created my own contentSearch/indexConfigurations/snowballLuceneIndexConfiguration set to apply this more surgically).  For reference, this links to my full .config patch for the Sitecore 8.1 rev 151207 build; it’s a .doc since WordPress won’t allow a real .config.

Here is just the <analyzer> piece:

<analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.PerExecutionContextAnalyzer, Sitecore.ContentSearch.LuceneProvider">
    <param desc="defaultAnalyzer" type="Sitecore.ContentSearch.LuceneProvider.Analyzers.DefaultPerFieldAnalyzer, Sitecore.ContentSearch.LuceneProvider">
        <param desc="defaultAnalyzer"                      
        type="Lucene.Net.Analysis.Snowball.SnowballAnalyzer, Lucene.Net.Contrib.Snowball">
            <param hint="version">Lucene_30</param>
            <param hint="name">English</param>
        </param>
    </param>
    ...
</analyzer>

This is what the index configuration patch looks like.

<sitecore>
  <contentSearch>
    <configuration>
      <indexes>
        <index id="sitecore_master_index" type="Sitecore.ContentSearch.LuceneProvider.LuceneIndex, Sitecore.ContentSearch.LuceneProvider">
          <configuration>
            <patch:attribute name="ref">contentSearch/indexConfigurations/snowballLuceneIndexConfiguration</patch:attribute>
          </configuration>
        </index>
      </indexes>
    </configuration>
  </contentSearch>
</sitecore>

I have a lot more to say on this topic, but for now I’ll let this alone as this is sufficient for anyone to experiment with the SnowballAnalyzer for Sitecore.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s