Snowball Analyzer for Sitecore

During an assessment of a customer’s Sitecore implementation, I re-aquainted myself with an old familiar area in the Sitecore.ContentSearch.LuceneProvider assembly. For some previous projects, now a year ago or longer, I spent *days* digging through the ContentSearch library working on some very specific tuning scenarios. This past week, I had a chance to revisit my notes from these efforts and formulate a recommendation for a customer looking to improve their Sitecore search performance.

In this particular case, a customer was frustrated by the Content Management side of Sitecore and some of the particularities around search.  Besides sharing the Sitecore Content Author’s Cookbook and encouraging them to thoroughly review the pieces about special characters in the search query, another aspect we assisted with was their evaluation of alternative Analyzers for search with Sitecore’s Lucene provider.

SnowballAnalyzer (Lucene.Net.Analysis.Snowball.SnowballAnalyzer)

With the SnowballAnalyzer, included with Sitecore installations as part of the Lucene.Net.Contrib.Snowball assembly that comes with Sitecore ootb, if one searches for “apples” and the text contains the value “apple” — but not the plural — the “apple” result will be returned.  Internally, apples and apple are reduced by the SnowballAnalyzer to appl.  Furthermore, using the SnowballAnalyzer, if text contains the word “weakness” in the value and the “weak” or “weakly” search query is executed, it would match the item.  Try it out for yourself at http://snowballstem.org/demo.html — try out raining, rained, and rain and see how they all reduce to the same core courtesy of the SnowballAnalyzer logic.

To put the SnowballAnalyzer to work for your Sitecore implementation, there are a couple approaches.  In this case, I created a patch config file to shift the sitecore_master_index to use Snowball instead.  The key element is the analyzer definition for the specific index (I would NOT suggest editing the defaultLuceneIndexConfiguration directly — I created my own contentSearch/indexConfigurations/snowballLuceneIndexConfiguration set to apply this more surgically).  For reference, this links to my full .config patch for the Sitecore 8.1 rev 151207 build; it’s a .doc since WordPress won’t allow a real .config.

Here is just the <analyzer> piece:

<analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.PerExecutionContextAnalyzer, Sitecore.ContentSearch.LuceneProvider">
    <param desc="defaultAnalyzer" type="Sitecore.ContentSearch.LuceneProvider.Analyzers.DefaultPerFieldAnalyzer, Sitecore.ContentSearch.LuceneProvider">
        <param desc="defaultAnalyzer"                      
        type="Lucene.Net.Analysis.Snowball.SnowballAnalyzer, Lucene.Net.Contrib.Snowball">
            <param hint="version">Lucene_30</param>
            <param hint="name">English</param>
        </param>
    </param>
    ...
</analyzer>

This is what the index configuration patch looks like.

<sitecore>
  <contentSearch>
    <configuration>
      <indexes>
        <index id="sitecore_master_index" type="Sitecore.ContentSearch.LuceneProvider.LuceneIndex, Sitecore.ContentSearch.LuceneProvider">
          <configuration>
            <patch:attribute name="ref">contentSearch/indexConfigurations/snowballLuceneIndexConfiguration</patch:attribute>
          </configuration>
        </index>
      </indexes>
    </configuration>
  </contentSearch>
</sitecore>

I have a lot more to say on this topic, but for now I’ll let this alone as this is sufficient for anyone to experiment with the SnowballAnalyzer for Sitecore.

Advertisements

How Sitecore Phones Home (maybe)

A question came up today around the office: does Sitecore monitor their installation base via a phone-home mechanism of some kind?  It’s relevant in a number of ways.   One way being as we work on cloud installations and elastic models for Sitecore, what data might Sitecore have that we could leverage for tracking licensing compliance, utilization, etc.

There may be other ways Sitecore tracks their server activities, but I used Reflector on the Sitecore.sitecore.login namespace to find one very likely place where Sitecore implements a call-back to the mothership.  Here goes . . .

Using a clean Sitecore 8.1 rev 151207 installation, in the source for the sitecore/login/default.aspx page is the following mark-up.  Note the “display:none;” on line 1 that hides this div from view:

   1:      <div id="licenseOptions" style="display: none;">
   2:  <%--            <h2 class="form-signin-heading">License and browser information</h2>--%>
   3:              <div class="license-info-wrap">
   4:                <ul>
   5:                  <li>System information</li>
   6:                  <li>License holder <%# License.Licensee %></li>
   7:                  <li>License ID <%# License.LicenseID %></li>
   8:                  <li>Sitecore version <%# About.VersionInformation() %></li>
   9:                </ul>
  10:  
  11:                <iframe id="StartPage" runat="server" allowtransparency="true" frameborder="0" scrolling="auto"
  12:                      marginheight="0" marginwidth="0" style="display: none; height: 105px;"></iframe>
  13:  
  14:              </div>
  15:              <div class="login-link-wrap">
  16:                <a href="javascript:;" id="licenseOptionsBack" class="login-link">&lt; Back</a>
  17:              </div>
  18:  
  19:            </div>

If we remove the “display:none” and load the page, some interesting details about the environment are revealed.  On my local machine it appears like this:

displayed

That’s not particularly interesting, but if you return to the mark-up and examine the IFrame defined on line #11 named StartPage with the runat=server attribute, you might turn your attention to what’s going on server-side when this default.aspx page loads.

Reflector, or any decompiler, will show that Sitecore.sitecore.login.Default contains an OnInit method with various checks for authentication, databinding, and so on.  I’ll omit most of that method except for the one call that’s of interest here, as we look for a way our local Sitecore system could communicate back to Sitecore the company.

protected override void OnInit(EventArgs e)
{

this.RenderSdnInfoPage();

}

Diving into the RenderSdnInfoPage method, we have this:

   1:  private void RenderSdnInfoPage()
   2:  {
   3:      this.StartPage.Attributes["src"] = new UrlString(Settings.Login.SitecoreUrl) {
   4:          ["id"] = Sitecore.SecurityModel.License.License.LicenseID,
   5:          ["host"] = WebUtil.GetHostName(),
   6:          ["licensee"] = Sitecore.SecurityModel.License.License.Licensee,
   7:          ["iisname"] = WebUtil.GetIISName(),
   8:          ["st"] = WebUtil.GetCookieValue("sitecore_starttab", string.Empty),
   9:          ["sc_lang"] = Context.Language.Name,
  10:          ["v"] = About.GetVersionNumber(true)
  11:      }.ToString();
  12:      this.StartPage.Attributes["onload"] = "javascript:this.style.display='block'";
  13:  }

And now we’re in business!  Sitecore loads up various pieces of data from the local running Sitecore instance and appends it to the end of our URL that we inject into the src attribute of the IFrame.  That src attribute in a default Sitecore installation starts with http://sdn.sitecore.net/startpage.aspx (defined as the Login.SitecoreUrl configuration setting) and the rest of the URL is information communicated to Sitecore about this local instance.

For example, using my Firefox HTML Inspector, my generated IFrame source was as follows (know that I have altered identifying information about my license):

<i frame src=”http://sdn.sitecore.net/startpage.aspx?id=20150821010030&host=sc81rev151207&licensee=Not%20Really%20Sharing&iisname=b86449a111164a2cd9c37d771c094dce&st&sc_lang=en&v=8.1.151207” id=”StartPage” allowtransparency=”true” scrolling=”auto” marginheight=”0″ marginwidth=”0″ style=”display: block; height: 105px;” onload=”javascript:this.style.display=’block'” frameborder=”0″></i frame>

I think it’s a pretty safe bet that startpage.aspx on the sdn.Sitecore.net site has some logic running to catalog the querystring parameters and passively monitor what’s going on in the wild.

Faster Sitecore Publishing for Global Implementations

My post over at the main Rackspace developer blog about Sitecore publishing strategies was something I’d like to link to from here.

The crux of it is this diagram illustrating how we can make use of SQL Replication to move Sitecore content between data centers, but still take advantage of Sitecore publishing to promote content to the live site.  The blog at Rackspace discusses it in some detail, so I’ll just point out that the idea is to let SQL Server Replication do the hard (and potentially slow) work of synchronizing content around the planet, while Sitecore publishing can be the plain old HTTP publish operation we know and love from Sitecore:

2016-02-23-Sitecore-Enterprise-Architecture-For-Global-Publishing