Sitecore and SearchMaxResults for Solr

I’ve consulted with a number of Sitecore implementations in the last month or two that had a variety of challenges with Sitecore integration with Solr. Solr is a powerful distributed system with a variety of angles of interest to Sitecore. I wanted to take this chance to comment on a Sitecore setting that can have a significant impact on how Sitecore search functions, but is easily overlooked. The setting is defined in Sitecore.ContentSearch.config and it’s called ContentSearch.SearchMaxResults. The XML comment for this setting is straight-forward, here’s how it’s presented in that file:

snip

There’s a lot to digest in that xml comment above. One could read it and assume “this can be set but it is best kept as the default” means this shouldn’t be altered, but in my experience that can be problematic.

The .Net int.MaxValue constant is 2,147,483,647. If you leave this setting at the default (so “”), one is telling Solr to return up to 2,147,483,647 results in a single response to a query, which we’ve observed in some cases to cause significant performance problems (Solr will fetch the large volume of records from disk and write them to the Solr response causing IO pressure etc.) It’s not always the case since it really comes down to the number of documents one is querying from Solr, but this sets up the potential for a virtually unbounded Solr query.

It’s interesting to trace this setting through Sitecore and into Solr, and it sheds light on how these two complex systems work together. Fun, right!? I cooked up the diagram below that shows an overview of how Sitecore and Solr work together in completing search queries:

snipp

Each application has it’s own logging which will help trace activity between the systems.

The Sitecore ContentSearch Provider for Solr relies on Solr.Net for connectivity to Solr. It’s common for .Net libraries to copy their open source equivalents from the Java world (like Log4J has a .Net port for logging named Log4net, Lucene has a .Net port for search called Lucene.Net, etc). Solr.Net, however, is not a port of the Solr Java application to .Net. Instead, Solr.Net is a wrapper for the main Solr API elements that can be easily called by .Net applications. When it comes to Sitecore’s ContentSearch Provider for Solr, Solr.Net is Sitecore’s bridge for getting data to and from the Solr application.

Just an aside: some projects do creative things with Solr search and Sitecore, and for certain scenarios it’s necessary to bypass Solr.Net and use the REST API directly from Sitecore. This write-up focuses on a conventional Sitecore -> Solr.Net -> Solr pattern, but I wanted to acknowledge that it’s not the only pattern.

Tracking ContentSearch.SearchMaxResults in Sitecore

On the Sitecore side, one can see the ContentSearch.SearchMaxResults setting in the Sitecore logs when you turn up the diagnostics to get more granular data; this isn’t a configuration that’s recommended for using beyond a discrete troubleshooting session as the amount of data it can generate can be significant . . . but here’s how you dial up the diagnostic data Sitecore reports about queries:

snip3

If you run a few queries that exercise your Sitecore implementation code that queries Solr, you can review the contents of the Search log in the Sitecore /data directory and find entries such as:

INFO Solr Query – ?q=associated_docs:(“\*A5C71A21-47B5-156E-FBD1-B0E5EFED4D33\*”)&rows=2147483647&fq=_indexname:(domain_index_web)

or

INFO  Solr Query – ?q=((_fullpath:(\/sitecore/content/Branches/ABC/*) AND _templates:(d0351826b1cd4f57ac05816471ba3ebc)))&rows=2147483647&fq=_indexname:(domain_index_web)

The .Net int.MaxValue 2147483647 is what Sitecore, through Solr.Net, is specifying as the number of rows to return from this query. For Solr cores with only a few hundred results matching this query, it’s not that big a deal because the query has a fairly small universe to process and retrieve. If you have 100,000 documents, however, that’s a very heavy query for Solr to respond to and it will probably impact the performance of your Sitecore implementation.

Tracking ContentSearch.SearchMaxResults in Solr

Solr has it’s own logging systems and this 2147483647 value can be seen in these logs once Solr has completed the API call. In a default Solr installation, the logs will be located at server/logs (check your log4j.properties file if you don’t know for sure where your logs are being stored) and you can a open up the log and scan for our ContentSearch.SearchMaxResults setting value. You’ll see entries such as:

INFO  – 2018-03-26 21:20:19.624; org.apache.solr.core.SolrCore; [domain_index_web] webapp=/solr path=/select params={q=(_template:(a6f3979d03df4441904309e4d281c11b)+AND+_path:(1f6ce22fa51943f5b6c20be96502e6a7))&fl=*,score&fq=_indexname:(domain_index_web)&rows=2147483647&version=2.2} hits=2681 status=0 QTime=88

  • The above Solr query returned 2,681 results (hits) and the QTime (time elapsed between the arrival of the query request to Solr and the completion of the request handler) was 88 milliseconds. This is probably no big deal as it relates to the ContentSearch.SearchMaxResults, but you don’t know if this data will increase over time…

INFO  – 2018-03-26 21:20:19.703; org.apache.solr.core.SolrCore; [domain_index_web] webapp=/solr path=/select params={q=((((include_nav_xml_b:(True)+AND+_path:(00ada316e3e4498797916f411bc283cf)))+AND+url_s:[*+TO+*])+AND+(_language:(no-NO)+OR+_language:(en)))&fl=*,score&fq=_indexname:( domain_index_web)&rows=2147483647&version=2.2} hits=9 status=0 QTime=16

  • The above Solr query returned 9 results (hits) and the QTime was 16 milliseconds. This is unlikely a problem when it comes to ContentSearch.SearchMaxResults.

 INFO  – 2018-03-26 21:20:19.812; org.apache.solr.core.SolrCore; [domain_index_web] webapp=/solr path=/select params={q=(_template:(8ed95d51A5f64ae1927622f3209a661f)+AND+regionids_sm:(33ada316e3e4498799916f411bc283cf))&fl=*,score&fq=_indexname:(domain_index_web)&rows=2147483647&version=2.2} hits=89372 status=0 QTime=1600

  • The above Solr query returned 89,372 results (hits) and the QTime was 1600 milliseconds. Look out. This is the type of query that could easily cause problems based on the Sitecore ContentSearch.SearchMaxResults setting as the volume of data Solr is working with is getting large. That query time is already climbing high and that’s a lot of results for Sitecore to require in a single request.

The impact of retrieving so many documents from Solr can cause a cascade of difficulties besides just the handling of the query. Solr caches the query results in memory and if you request 1,000,000 documents you could also be caching 1,000,000 million documents. Too much of this activity and it can stress Solr to the breaking point.

Best Practices

There is no magic value to set for ContentSearch.SearchMaxResults other than not “”. General best practice when retrieving lots of data from most any system is to use paging. It’s recommended to do that for Sitecore ContentSearch queries, too. A general recommendation would be to set a specific value for the ContentSearch.SearchMaxResults setting, such as “500” or “1000”. This should be thoroughly tested, however, as limiting the search results for an implementation that isn’t properly using search result paging can lead to inconsistent behavior across the site. Areas such as site maps, general site search, and other areas with implementation logic that could assume all the search results are available in a single request deserve special attention when tuning this setting.

What About Noisy Solr Neighbors?

I’ve worked on some implementations where Solr was a resource shared between a variety of Sitecore implementations. One project, in this example, might set ContentSearch.SearchMaxResults to “2000” for their Sitecore sites while another project sets a value of “500” – but what if there’s a third organization making use of the same Solr resources and that project doesn’t explicitly set a value for ContentSearch.SearchMaxResults? That one project leaves the setting at the “” default, so it uses the .Net int.MaxValue. This is a classic noisy neighbor problem where shared services become a point of vulnerability to all the consuming applications. The one project with “” for ContentSearch.SearchMaxResults could be responsible for dragging Solr performance down across all the sites.

Solr is an extensible platform much like Sitecore, and in some ways even more so. In Sitecore one extends pipelines or overrides events to achieve the customizations you desire; the same general idea can be applied to Solr – you just use Java with Solr instead of C#.

In this case, our concern being unbounded Solr queries, we can use an extension to a search component (org.apache.solr.handler.component.SearchComponent) to introduce our custom logic into the Solr query processing. In our case, we want to enforce limits to the number of rows a query can request. This would be a safety net in case an un-tuned Sitecore implementation left a ContentSearch.SearchMaxResults setting at the default.

Some care must be taken in how this is introduced into the Solr execution context and where exactly the custom logic is handled. This topic is all covered very well, along with sample code etc, at https://jorgelbg.wordpress.com/2015/03/12/adding-some-safeguard-measures-to-solr-with-a-searchcomponent/. For an enterprise Solr implementation with a variety of Sitecore consumers, a safety measure such as this could be vital to the general stability and perf of the platform – especially if one doesn’t have control over the specific Sitecore projects and their use (or abuse!) of the ContentSearch.SearchMaxResults setting. Maybe file this under best practices for governing Sitecore implementations with shared Solr infrastructure.

Azure Search compared to Solr for Sitecore PaaS (Chapter 1: Ingestion)

I’ve been investigating Azure PaaS architectures for Sitecore lately, and I wanted to take a few minutes and summarize some recent findings around the standard Sitecore search providers of Solr and, new for Sitecore PaaS, Azure Search.

To provision Azure PaaS Sitecore environments, I used a variant of the ARM Template approach outlined in this blog.  For simplicity, I evaluated a basic “XP-0” which is the name for the Sitecore CM/CD server combined into a single App Service.  This is considered a basic setup for development or testing, but not real production . . . that’s OK for my purposes, however, as I’m interested in comparing the Sitecore search providers to get an idea for relative performance.

The Results

I’ll save the methodology and details for lower in this post, since I’m sure most don’t want to wait for an idea for the results.  The Solr search provider performed faster, no matter the App Service or DB Tier I evaluated in Azure PaaS:

ChartComparison

The chart shows averages to perform the full re-index operation in minutes.  You may want to refer to my earlier post about the lack of HA with Sitecore’s use of Azure Search; rest assured Sitecore is addressing this in a product update soon, but for now it casts a more significant shadow over the 60+ minutes one could spend waiting for the search re-index to complete.

Methodology

In these PaaS trials, I setup the sample site LaunchSitecore.  I performed rebuilds of the sitecore_core_index through the Sitecore Control Panel as my benchmark; I like using this operation as a benchmark since it has over 80,000 documents.  It doesn’t particularly exercise the querying aspects of Sitecore search, though, so I’ll save that dimension for another time.  I’ve got time set aside for JMeter testing that will shed light on this later…

To get the duration the system took to complete the re-index, I queried the PaaS Sitecore logs as described in this Sitecore KB article.  Using results like the following, I took the timestamps since I’ve found the Sitecore UI to be unreliable in reporting duration for index rebuilds.

queries

You can get at this data yourself in App Insights with a query such as this:

traces
| where timestamp > now(-3h)
| where message contains " Crawler [sitecore_core_index]" 
| project timestamp, message
| sort by timestamp desc

Remember, I’ve used the XP0 PaaS ARM Templates which combine CM and CD roles together, so there’s no need for the “where cloud_RoleInstance == ‘CloudRoleBlahBlah'” in the App Insights query.

Methodology – Azure Search

For my Azure Search testing, I experimented with scaling options for Azure Search.  For speedier document ingestion, the guidance from Microsoft says:

“Partitions allow for scaling of document counts as well as faster data ingestion by spanning your index over multiple Azure Search Units”

The trials should perform more quickly with additional Azure Search Partitions, but I found changing this made zero difference.  My instincts tell me the fact Sitecore isn’t using Azure Search Indexers could be a reason scaling Azure Search doesn’t improve performance in my trials.  Sitecore is making REST calls to index documents with Azure Search, which is fine, but possibly not the best fit for high-volume operations.  I haven’t looked in the DLLs, but perhaps there’s other async models one could use in the the Azure Search provider when it comes to full re-indexes?  It could also be that the 80,000 documents in the sitecore_core_index is too small a number to take advantage of Azure Search’s scaling options.  This will be an area for additional research in the future.

Methodology – Solr

To host Solr for this trial, I used a basic Solr VM in the Rackspace cloud.  One benefit to working at Rackspace is easy access to these sorts of resources 🙂  I picked a 4 GB server running Solr 5.5.1.  I used a one Solr core per Sitecore index (1:1 mapping), see my write-up on Solr core organization if you’re not following why this might be relevant.

For my testing with the Solr search provider, Sitecore running  Azure PaaS needed to connect outside Azure, so I selected a location near to Azure US-East where my App Service was hosted.  I had some concerns about outbound data charges, since data leaving Azure will trigger egress bandwidth fees (see this schedule for pricing).  For the few weeks while I collected this data, the outbound data fee totaled less than $40 — and that includes other people using the same Azure account for other experiments.  I estimate around 10% (just $4) is due to my experiments.  Suffice it to say using a Solr environment outside of Azure isn’t a big expense to worry about.  Just the same, running Solr in an Azure VM would certainly be the recommendation for any real Sitecore implementation following this pattern.  For these tests, I chose the Rackspace VM since I already had it handy.

I’d be remiss to not mention the excellent work Sitecore’s Ivan Sharamok has posted to help make Solr truly enterprise ready with Sitecore.  Basic Auth for Solr with Sitecore is important for the architecture I exercised; this post is another gem of Ivan’s worth including here, even if I didn’t make use of it in this specific set of evaluations.  Full disclosure: I worked with Ivan while I was at the Sitecore mothership, so I’m biased that his contributions are valuable, but just because I’m biased doesn’t mean I’m wrong.

Conclusions

I’ll include my chart once again:

ChartComparison

These findings lead me to more questions than answers, so I’m hesitant to make any sweeping generalizations here.  I’m safe declaring Sitecore’s search provider for Solr to be faster than the Azure Search alternative when it comes to full index re-builds, that’s clear by an order of magnitude in some cases.  Know that this is not a judgement about Solr versus Azure Search;  this is about the way Sitecore makes use of these two search technologies out of the box.  The Solr provider for Sitecore is battle-tested and has gone through many years of development; I think the Azure Search provider for Sitecore could be considered a beta at this point, so it’s important to not get ahead of ourselves.

A couple other conclusions could be:

  1. Whether using Solr or Azure Search, there is no improvement to search re-index performance when changing between the S3 to P3 tiers in Azure App Services.
  2. Changing from the S1 to S3 tiers, on the other hand, makes a big perf difference in terms of search re-indexing.
    • Honestly, the S1 tier is almost unusable as the single CPU core and 1.75 GB RAM are way too low for Sitecore; the S3 with 4 cores and 7 GB RAM is much more reasonable to work with.

Next Up

It’s time for me to consider the more fully scaled PaaS options with Sitecore, and I need to exercise the query side of the Sitecore search provider instead of just the indexing side.

Solr Configuration for Integration with Sitecore

I’ve got a few good Solr and Sitecore blogs around 75% finished, but I’ve been too busy lately to focus on finishing them.  In the meantime, I figure a picture can be worth 1,000 words sometimes so let me post this visual representation of Solr strategies for Sitecore integrations.  One Solr core per index is certainly the best practice for production Sitecore implementations, but now that Solr support has significantly matured at Sitecore a one Solr core for all the Sitecore indexes is a viable, if limited, option:

draft

There used to be a bug (or two?) that made this single Solr core for every Sitecore index unstable, but that’s been corrected for some time now.

More to follow!

Sitecore Gets Serious About Search, by Leaving the Game

Sitecore is steadily reducing the use-cases for using Lucene as the search provider for the Sitecore product.  Sitecore published this document on “Solr or Lucene” a few days ago, going so far as to state this one criteria where you must use Solr:

  • You have two or more content delivery servers

Read that bullet point again.  If you’ve worked with Sitecore for a while, this should trigger an alarm.

Sound the alarm: most implementations have 2 or more CD servers!  I’d say 80% or more of Sitecore implementations are using more than a single CD server, in fact.  Carrying this logic forward, 80% of Sitecore implementations should not be running on Lucene!  This is a big departure for Sitecore as a company, who would historically dodge conversations about what is the right technology for a given situation.  I think the Sitecore philosophy was that advocating for one technical approach over another is risky and means a Sitecore endorsement could make them accountable if a technology approach fails for one reason or another.  For as long as I’ve known Sitecore, it’s a software company intent on selling licenses, not dictating how to use the product.  Risk aversion has kept them from getting really down in the weeds with customers.  This tide has been turning, however, with things like Helix and now this more aggressive messaging about the limitations of Lucene.  I think it’s great for Sitecore to be more vocal about best practices, it’s just taken years for them to come around to the idea.

As a bit of a search geek, I want to state for the record that this new Solr over Lucene guidance from Sitecore is not really an indictment of Lucene.  The Apache Lucene project, and it’s cousin the .Net port Lucene.net that Sitecore makes use of out-of-the-box, was ground breaking in many ways.  As a technology, Lucene can handle enormous volumes of documents and performs really well.  Solr is making use of Lucene underneath it all, anyway!  This recent announcement from Sitecore is more acknowledgement that Sitecore’s event plumbing is no substitute for Solr’s CAP-theorem straddling acrobatics.  Sitecore is done trying to roll their own distributed search system.  I think this is Sitecore announcing that they’re tired of patching the EventQueue, troubleshooting search index update strategies for enterprise installations, and giving up on ensuring clients don’t hijack the Sitecore heartbeat thread and block search indexing with some custom boondoggle in the publishing pipeline.  They’re saying: we give up — just use Solr.

Amen.  I’m fine with that.

To be honest, I think this change of heart can also be explained by the predominant role Azure Search plays in the newest Sitecore PaaS offering.  Having an IP address for all the “search stuff” is awful nice whether you’re running in Azure or anywhere else.  It’s clear Sitecore decided they weren’t keen to recreate the search wheel a few years ago, and are steadily converging around these technologies with either huge corporate backing (Azure Search) or a vibrant open source community (Solr).

I should also note, here, that Coveo for Sitecore probably welcomes this opportunity for Sitecore teams to break away from the shackles of the local Lucene index.  I’m not convinced that long-term Coveo can out-run the likes of Azure Search or Solr, but I know today if your focus is quickly ramping up a search heavy site on Sitecore you should certainly weight the pros/cons of Coveo.

With all this said, I want to take my next few posts and dig really deep into Solr for Sitecore and talk about performance monitoring, tuning, and lots of areas that get overlooked when it comes to search with Sitecore and Solr.   So I’ll have more to say on this topic in the next few days!

 

The Sitecore Pie: strategic slicing for better implementations

The Sitecore Pie

pie
At some point I want to catalog the full set of features a Sitecore Content Delivery and Content Management server can manage in a project.  My goal would be to identify all the elements that can be split apart into independent services.  This post is not a comprehensive list of those features, but serves to introduce the concept.

Think of Sitecore as a big blueberry pie that can be sliced into constituent parts.  Some Sitecore sites can really benefit from slicing the pie into small pieces and letting dedicated servers or services manage that specific aspect of the pie.  Too often, companies don’t strategize around how much different work their Sitecore solution is doing.

An example will help communicate my point: consider IIS and how it serves as the execution context for Sitecore.   Many implementations will run logic for the following through the same IIS server that is handling the Sitecore request for rendering a web page.  These are all slices of the Sitecore pie for a Sitecore Content Delivery server:

  1. URL redirection through Sitecore
  2. Securing HTTP traffic with SSL
  3. Image resizing for requests using low-bandwidth or alternative devices
  4. Serving static assets like CSS, JS, graphics, etc
  5. Search indexing and query processing (if one is using Lucene)

If you wanted to cast a broader net, you could include HTTP Session state for when InProc mode is chosen, Geo-IP look-ups for certain CD servers, and others to this list of pie slices.  Remember, I didn’t claim this was an exhaustive list.  The point is: IIS is enlisted in all this other work besides processing content into HTML output for Sitecore website visitors.

Given our specific pie slices above, one could employ the following alternatives to relieve IIS of the processing:

  1. URL Redirection at the load balancer level can be more performant than having Sitecore process redirects
  2. Apply SSL between the load balancer and the public internet, but not between the IIS nodes behind your load balancer — caled “SSL Offloading” or “SSL Termination”
  3. There are services like Akamai that fold in dynamic image processing as part of their suite of products
  4. Serving static assets from a CDN is common practice for Sitecore
  5. Coveo for Sitecore is an alternative search provider that can take a lot of customer-facing search aspects and shift it to dedicated search servers or even Coveo’s Cloud.  One can go even further with Solr for Sitecore or still other search tiers if you’re really adventurous

My point is, just like how we hear a lot this election season about “let Candidate X be Candidate X” — we can let Sitecore be Sitecore and allow it to focus on rendering content created and edited by content authors and presenting it as HTML responses.  That’s what Sitecore is extremely valuable for.

Enter the Cache

I’ve engaged with a lot of Sitecore implementations who were examining their Sitecore pie and determining what slices belong where . . . and frequently we’d make the observation that the caching layer of Sitecore was tightly coupled with the rest of the Sitecore system and caching wasn’t a good candidate for slicing off.  There wasn’t a slick provider model for Sitecore caches, and while certain caches could be partially moved to other servers, it wasn’t clean, complete, or convenient.

That all changed officially with the initial release of Sitecore 8.2 last month.  Now there is a Sitecore.Caching.DefaultCacheManager class, a Sitecore.Caching.ICache interface, and other key extension points as part of the standard Sitecore API.  One can genuinely add the Sitecore cache to the list of pie slices one can consider for off-loading.

In my next post, I will explore using this new API to use Redis as the cache provider for Sitecore instead of the standard in-memory IIS cache.

Sitecore RemoteRebuild Strategy Best Practices

I spent entirely too long troubleshooting a customer’s approach to the RemoteRebuild indexing strategy for Sitecore.  The official documentation is fairly straight-forward, but there are some significant gaps left up to the reader to infer or figure out.

I think the heading “Best Practice” on that documentation page is great, and I hope Sitecore continues to expand those notes to be as thorough as possible.  That being said, I would include the following example patch configuration showing how to apply the change without manually editing Sitecore base install files:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <contentSearch>
      <configuration type="Sitecore.ContentSearch.ContentSearchConfiguration, Sitecore.ContentSearch">
        <indexes hint="list:AddIndex">
           <index id="sitecore_web_index" type="Sitecore.ContentSearch.LuceneProvider.LuceneIndex, Sitecore.ContentSearch.LuceneProvider">
            <strategies hint="list:AddStrategy">
              <strategy ref="contentSearch/indexConfigurations/indexUpdateStrategies/remoteRebuild" />
            </strategies>
          </index>
        </indexes>
      </configuration>
    </contentSearch>
  </sitecore>
</configuration>

This patch should be applied on the Sitecore CD servers where you want to perform the Lucene re-indexing operations.  There are no configuration changes necessary to the Sitecore CM servers.

Speaking of CM server, one might think the posted Sitecore prerequisites cover it:

  • The name of the index on the remote server must be identical to the name of the index that you forced to rebuild.
  • You must enable the EventQueue.
  • The database you assign for system event queue storage (core by default) must be shared between the Sitecore instance where the rebuild takes place and the other instances.

But the biggest addition to this I found was that the Indexing Manager feature in the Control Panel does not call the proper API to trigger the RemoteRebuild activity, so this screen is not where one initiates the remote rebuild:

Won’t work:

remoterebuild

The only way to properly activate the RemoteRebuild is via the Developer ribbon in the Sitecore Content Editor

This works:

remoterebuild2

See this quick video on how to enable this Developer area in the Content Editor, in case this is new to you.

Apparently this dependence on the Developer ribbon is a bug in Sitecore and scheduled to be corrected in a future release.  I know I spent several hours troubleshooting configuration changes and attempted many permutations of the configuration for the RemoteRebuild strategy before Sitecore support confirmed this fact (as did the ever-helpful Ivan Sharamok).

The only other detail I’ll share on the Sitecore RemoteRebuild strategy is that one should verify if the Lucene index adding RemoteRebuild to is using Sitecore.ContentSearch.LuceneProvider.SwitchOnRebuildLuceneIndex or just Sitecore.ContentSearch.LuceneProvider.LuceneIndex.  The patch .config file one uses should reference the appropriate type attribute.

A final note I’ll add about rationale for the RemoteRebuild . . . this customer has several Sitecore Content Delivery servers spread in different data centers, and is currently reliant on the Lucene search provider for Sitecore.  One could make a strong case the customer should be using Solr instead, and indeed this is on their implementation road map, but in the meantime we suggested the RemoteRebuild as a way for them to centrally manage the state of indexes on all their disparate CD servers.  The alternative would be depending on the onPublishEndAsync strategy (which works well, but has limited applications in certain cases), or doing some administrative connection to each CD server (via browser or PowerShell etc) and doing something along the lines of the following when they need to rebuild specific indexes:

public void Page_Load(object sender, EventArgs e)
{
        Server.ScriptTimeout = 3600; //one hour in case this takes a while
        ContentSearchManager.GetIndex("your_index_name").Rebuild();
}

This quickly becomes unwieldy and hard to maintain, however, so getting RemoteRebuild wired up in this case is going to be a valuable addition to the implementation.

Search User Group Talk

I did a talk for the New England Search Technology (NEST) user group yesterday.  Even though the meetings in Boston are a good 90+ minutes away for me, I try to make the trip there a couple times a year since the topics are usually very relevant to what I’m up to with Sitecore.   I offered to do a talk bridging the Sitecore and Search domains, and they took me up on it.  The audience is typically serious Solr and ElasticSearch technologists, some regular committers to those projects, so it was fun to combine those domains with Sitecore’s relative immaturity when it comes to the platform of search.

I don’t want to just post the powerpoint presentations and say “have at it” since the presentations require context to make sense (and it is 2 different powerpoints to sift through).  I’m a non-conventional guy, and I try to avoid powerpoint with bullet points galore and hyper-structured material.  The talk was more a conversation about search and the challenges unique to search with Sitecore (and other CMS platforms that build on Lucene).

My premise relied heavily on Plato’s Allegory of the Cave where I was a “prisoner” experiencing the world of search through the “cave” of Sitecore.  In reality, search is an enormous space with lots of complexity and innovation across the technology . . . but in terms of Sitecore, we experience a filtered reality (the shadows on the cave wall).  This graphic represents a traditional Allegory of the Cave concept:

cave

I’m not going to summarize the entire talk, but I want to state for the record this isn’t a particular criticism of Sitecore — it’s just the nature of working with a product that is built on top of other products.  In Sitecore’s case with Lucene, for example, there’s a .Net port of Lucene (Lucene.Net) and a custom .Net event pipeline in Sitecore used to orchestrate Lucene.Net activities via an execution context of IIS, and so on.

Understanding Search technology with Sitecore is always going to be filtered through the lens of Sitecore.  My talk was addressing that fact, and how Search (with a capital “S”) is a far broader technology than what is put to use in Sitecore.  Understanding the world of Search beyond the confines of Sitecore can be very revealing, and opens up a lot of opportunity for improving a Sitecore implementation.

I should also note that my talk benefited from insightful input from Sitecore-Coveo Jeff, Tim “Sitecore MVP” Braga, and Al Cole from Lucidworks and others.  The host, Monster.com, shared a great meeting space for us and opened the evening with a quick survey of their specific experience with search using both Solr, Elasticsearch, and their own proprietary technology.

While not specific to Sitecore, one link I wanted to share in particular was the talk about Bloomberg’s 3-year journey with Solr/Lucene; it’s a talk from the Berlin Buzzwords conference a couple weeks ago and thoroughly worth watching.  Getting search right can take persistence and smart analysis, regardless of the platform.  With Sitecore, too often implementations assume Search will just work out of the box and not appreciate that it’s a critical set of features worthy of careful consideration.

I’ll have a few follow-up posts covering more of the points I made in my talk; some lend themselves to distinct blog posts instead of turning this into a sprawling re-hash of the entire evening.

Adding Custom Fields to Sitecore Search Indexes in Solr vs Lucene

We’re doing a lot with Solr lately, and I wanted to make a note of the difference in how one defines custom fields to include in your Sitecore indexes. I’m not including how one defines a full custom index, but just the field definition; if one has a “Description” and “Conversation” field to include in a search index for Sitecore, this summarizes the basics of how to make that happen.

Lucene

One adds <field> nodes to the fieldNames and fieldMap sections of the XML configuration.  Again, this example alters the defaultLuceneIndexConfiguration for Sitecore which isn’t a best-practice; it’s generally better to define your own index that is laser-focused just on the content you want to use for your implementation, but I wanted as succinct an example as possible to note the caveat!

 <sitecore>  
   <contentSearch>  
    <indexConfigurations>  
      <defaultLuceneIndexConfiguration type="Sitecore.ContentSearch.LuceneProvider.LuceneIndexConfiguration, Sitecore.ContentSearch.LuceneProvider">  
       <fieldMap type="Sitecore.ContentSearch.FieldMap, Sitecore.ContentSearch">  
         <fieldNames hint="raw:AddFieldByFieldName">  
          <field fieldName="Description" storageType="YES" indexType="TOKENIZED" vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">  
            <analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" />  
          </field>  
          <field fieldName="Conversation" storageType="YES" indexType="TOKENIZED" vectorType="NO" boost="1f" type="System.String" settingType="Sitecore.ContentSearch.LuceneProvider.LuceneSearchFieldConfiguration, Sitecore.ContentSearch.LuceneProvider">  
            <analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.LowerCaseKeywordAnalyzer, Sitecore.ContentSearch.LuceneProvider" />  
          </field>  
         </fieldNames>  
       </fieldMap>  
      </defaultLuceneIndexConfiguration>  
    </indexConfigurations>  
   </contentSearch>  
 </sitecore>  

Solr

One adds <fieldReader> nodes to the mapFieldByTypeName and FieldReaders sections of the XML configuration.  Same caveat applies here for Solr as with Lucene — it’s generally recommended to define your own Solr index and work from that, but I wanted a minimal example:

 <sitecore>  
  <contentSearch>    
   <indexConfigurations>  
    <defaultSolrIndexConfiguration type="Sitecore.ContentSearch.SolrProvider.SolrIndexConfiguration, Sitecore.ContentSearch.SolrProvider">  
     <FieldReaders type="Sitecore.ContentSearch.FieldReaders.FieldReaderMap, Sitecore.ContentSearch">  
      <mapFieldByTypeName hint="raw:AddFieldReaderByFieldTypeName">  
       <fieldReader fieldTypeName="Description"  fieldReaderType="Sitecore.ContentSearch.FieldReaders.RichTextFieldReader, Sitecore.ContentSearch" />  
       <fieldReader fieldTypeName="Conversation"  fieldReaderType="Sitecore.ContentSearch.FieldReaders.RichTextFieldReader, Sitecore.ContentSearch" />  
      </mapFieldByTypeName>  
     </FieldReaders>  
    </defaultSolrIndexConfiguration>  
   </indexConfigurations>  
  </contentSearch>  
 </sitecore>  

 

This is really just the tip of the iceburg, as fieldReaderType definitions for Solr and Lucene analyzers — just two examples — open up so many possibilities.  I struggled to find this side-by-side info for Lucene and Solr as it applies to Sitecore (Sitecore 8.1 update-1, specifically), so I wanted to share it here.

Sitecore Config for Publishing to Remote Data Centers

A couple months back, I shared our Rackspace publishing strategy for Sitecore in multiple data centers.  It’s fast, reliable, and we’ve had good experiences taking that approach.

This diagram shows what I’m talking about:2016-02-23-Sitecore-Enterprise-Architecture-For-Global-Publishing

That post at developer.Rackspace.com is only part of the picture, however, and I wanted to return to this topic since I recently had a chance to revisit our Sitecore configuration for a customer running Sitecore 7.2.  Referring to the diagram above, we needed to patch the client Sitecore configuration to accommodate the addition of the proxyWebForHongKong and proxyWebForDallas databases; we were finishing up the provisioning of a clean environment, and had replication running, but we needed to connect the two remote data centers so they fully participated with publishing.

Wading through the Sitecore documentation to find these key architectural pieces is tough, and you would need to synthesize at least the Search Scaling Guide, the Scaling Guide, and the Search and Indexing Guide to get to all this material.  For my own benefit, probably more than for any potential reader, I’m consolidating this here.

It’s important when one does a publish in Sitecore that it triggers index updates on the remote servers, as well as move the content to the appropriate Web database.  This is the crux of what I’m up to here.

At a high level, we had to complete these steps:

  1. Add the database connection strings for the two proxy databases serving our two remote data centers
  2. Add the two proxy database definitions under the <sitecore> node so they can be referenced as Sitecore Publishing Targets etc
  3. Add an index update strategy specific for each proxy database
    • This only needs to be on the environments where the index is used; for example, sitecore_web_index configurations would only be relevant on the CD servers
  4. Connect these new indexing strategies to indexes that need to be kept consistent between data centers
    • This customer is making some use of Lucene, so there is a sitecore_web_index that’s relevant here
    • This customer is using an alternative search provider, too, but I’m focusing on the out-of-the-box configuration as it relates to Lucene here

To implement all of the above, we would typically steer customers to a Sitecore patch configuration strategy such as this one by Bryan “Flash” Furlong.  For simplicity, however, we at Rackspace created a subdirectory App_Config/Include/Rackspace and placed a single config patch file in it.  We don’t own the solution structure, and work to isolate our changes as much as possible from a customer implementation.  So having a single file we can reference is handy for our use case — but not what I’d consider an absolute best practice.

Here’s a nice summary on using subdirectories for Sitecore configurations, in case that’s new to you; here’s Sitecore’s official configuration patch documentation if you’d like to review that: https://sdn.sitecore.net/upload/sitecore6/60/include_file_patching_facilities_sc6orlater-usletter.pdf.

Unfortunately, the <connectionStrings> definitions are not under the Sitecore node and so we can’t patch those the same way, but everything else is in the Sitecore configuration patch below.  I’m not going to detail the following XML, but summarize by saying steps 2 through 4 above are handled here.  Other relevant settings, such as EnableEventQueues etc, are set elsewhere as the Sitecore scaling guide has been broadly applied to this setup already . . . the following is just the changes we needed to make in this particular case, but I’m hoping the example is helpful:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <!-- 
  We cannot patch in a connection string, so be sure to explicitly add connectionstrings
  pointing to the appropriate proxy databases
  <add name="ProxyWebForHongKong" connectionString="Trusted_Connection=True;Data Source=SQLCluster;Database=ProxyWebForHongKong" />
  <add name="ProxyWebForDallas" connectionString="Trusted_Connection=True;Data Source=SQLCluster;Database=ProxyWebForDallas" />

-->
  <sitecore>

    <!-- Add the database configurations that match the connectionstring entries
    (The id of the <database> node must be equal to the name of your connection string node)-->
    <databases>
      <database id="ProxyWebForHongKong" singleInstance="true" type="Sitecore.Data.Database, Sitecore.Kernel">
        <param desc="name">$(id)</param>
        <icon>Network/16x16/earth.png</icon>
        <securityEnabled>true</securityEnabled>
        <dataProviders hint="list:AddDataProvider">
          <dataProvider ref="dataProviders/main" param1="$(id)">
            <disableGroup>publishing</disableGroup>
            <prefetch hint="raw:AddPrefetch">
              <sc.include file="/App_Config/Prefetch/Common.config" />
              <sc.include file="/App_Config/Prefetch/Webdb.config" />
            </prefetch>
          </dataProvider>
        </dataProviders>
        <proxiesEnabled>false</proxiesEnabled>
        <proxyDataProvider ref="proxyDataProviders/main" param1="$(id)" />
        <archives hint="raw:AddArchive">
          <archive name="archive" />
          <archive name="recyclebin" />
        </archives>
        <Engines.HistoryEngine.Storage>
          <obj type="Sitecore.Data.SqlServer.SqlServerHistoryStorage, Sitecore.Kernel">
            <param connectionStringName="$(id)" />
            <EntryLifeTime>30.00:00:00</EntryLifeTime>
          </obj>
        </Engines.HistoryEngine.Storage>
        <Engines.HistoryEngine.SaveDotNetCallStack>false</Engines.HistoryEngine.SaveDotNetCallStack>
        <cacheSizes hint="setting">
          <data>20MB</data>
          <items>10MB</items>
          <paths>500KB</paths>
          <itempaths>10MB</itempaths>
          <standardValues>500KB</standardValues>
        </cacheSizes>
      </database>
      <database id="ProxyWebForDallas" singleInstance="true" type="Sitecore.Data.Database, Sitecore.Kernel">
        <param desc="name">$(id)</param>
        <icon>Network/16x16/earth.png</icon>
        <securityEnabled>true</securityEnabled>
        <dataProviders hint="list:AddDataProvider">
          <dataProvider ref="dataProviders/main" param1="$(id)">
            <disableGroup>publishing</disableGroup>
            <prefetch hint="raw:AddPrefetch">
              <sc.include file="/App_Config/Prefetch/Common.config" />
              <sc.include file="/App_Config/Prefetch/Webdb.config" />
            </prefetch>
          </dataProvider>
        </dataProviders>
        <proxiesEnabled>false</proxiesEnabled>
        <proxyDataProvider ref="proxyDataProviders/main" param1="$(id)" />
        <archives hint="raw:AddArchive">
          <archive name="archive" />
          <archive name="recyclebin" />
        </archives>
        <Engines.HistoryEngine.Storage>
          <obj type="Sitecore.Data.SqlServer.SqlServerHistoryStorage, Sitecore.Kernel">
            <param connectionStringName="$(id)" />
            <EntryLifeTime>30.00:00:00</EntryLifeTime>
          </obj>
        </Engines.HistoryEngine.Storage>
        <Engines.HistoryEngine.SaveDotNetCallStack>false</Engines.HistoryEngine.SaveDotNetCallStack>
        <cacheSizes hint="setting">
          <data>20MB</data>
          <items>10MB</items>
          <paths>500KB</paths>
          <itempaths>10MB</itempaths>
          <standardValues>500KB</standardValues>
        </cacheSizes>
      </database>
    </databases>


    <!-- Add the specific update strategies for each data center-->
    <contentSearch>
      <indexUpdateStrategies>
        <onPublishEndAsyncProxyHongKong type="Sitecore.ContentSearch.Maintenance.Strategies.OnPublishEndAsynchronousStrategy, Sitecore.ContentSearch">
          <param desc="database">ProxyWebForHongKong</param>
          <CheckForThreshold>true</CheckForThreshold>
        </onPublishEndAsyncProxyHongKong>
        <onPublishEndAsyncProxyDallas type="Sitecore.ContentSearch.Maintenance.Strategies.OnPublishEndAsynchronousStrategy, Sitecore.ContentSearch">
          <param desc="database">ProxyWebForDallas</param>
          <CheckForThreshold>true</CheckForThreshold>
        </onPublishEndAsyncProxyDallas>
      </indexUpdateStrategies>




      <!-- Add the additional indexing strategies to the sitecore_web_index, so our additional regions get the messages to update Lucene-->
      <configuration type="Sitecore.ContentSearch.ContentSearchConfiguration, Sitecore.ContentSearch">
        <indexes hint="list:AddIndex">
          <index id="sitecore_web_index" type="Sitecore.ContentSearch.LuceneProvider.LuceneIndex, Sitecore.ContentSearch.LuceneProvider">
            <strategies hint="list:AddStrategy">
              <strategy ref="contentSearch/indexUpdateStrategies/onPublishEndAsyncProxyHongKong" />
              <strategy ref="contentSearch/indexUpdateStrategies/onPublishEndAsyncProxyDallas" />
            </strategies>
          </index>
        </indexes>
      </configuration>
      
      
    </contentSearch>
  </sitecore>
</configuration>

Snowball Analyzer for Sitecore

During an assessment of a customer’s Sitecore implementation, I re-aquainted myself with an old familiar area in the Sitecore.ContentSearch.LuceneProvider assembly. For some previous projects, now a year ago or longer, I spent *days* digging through the ContentSearch library working on some very specific tuning scenarios. This past week, I had a chance to revisit my notes from these efforts and formulate a recommendation for a customer looking to improve their Sitecore search performance.

In this particular case, a customer was frustrated by the Content Management side of Sitecore and some of the particularities around search.  Besides sharing the Sitecore Content Author’s Cookbook and encouraging them to thoroughly review the pieces about special characters in the search query, another aspect we assisted with was their evaluation of alternative Analyzers for search with Sitecore’s Lucene provider.

SnowballAnalyzer (Lucene.Net.Analysis.Snowball.SnowballAnalyzer)

With the SnowballAnalyzer, included with Sitecore installations as part of the Lucene.Net.Contrib.Snowball assembly that comes with Sitecore ootb, if one searches for “apples” and the text contains the value “apple” — but not the plural — the “apple” result will be returned.  Internally, apples and apple are reduced by the SnowballAnalyzer to appl.  Furthermore, using the SnowballAnalyzer, if text contains the word “weakness” in the value and the “weak” or “weakly” search query is executed, it would match the item.  Try it out for yourself at http://snowballstem.org/demo.html — try out raining, rained, and rain and see how they all reduce to the same core courtesy of the SnowballAnalyzer logic.

To put the SnowballAnalyzer to work for your Sitecore implementation, there are a couple approaches.  In this case, I created a patch config file to shift the sitecore_master_index to use Snowball instead.  The key element is the analyzer definition for the specific index (I would NOT suggest editing the defaultLuceneIndexConfiguration directly — I created my own contentSearch/indexConfigurations/snowballLuceneIndexConfiguration set to apply this more surgically).  For reference, this links to my full .config patch for the Sitecore 8.1 rev 151207 build; it’s a .doc since WordPress won’t allow a real .config.

Here is just the <analyzer> piece:

<analyzer type="Sitecore.ContentSearch.LuceneProvider.Analyzers.PerExecutionContextAnalyzer, Sitecore.ContentSearch.LuceneProvider">
    <param desc="defaultAnalyzer" type="Sitecore.ContentSearch.LuceneProvider.Analyzers.DefaultPerFieldAnalyzer, Sitecore.ContentSearch.LuceneProvider">
        <param desc="defaultAnalyzer"                      
        type="Lucene.Net.Analysis.Snowball.SnowballAnalyzer, Lucene.Net.Contrib.Snowball">
            <param hint="version">Lucene_30</param>
            <param hint="name">English</param>
        </param>
    </param>
    ...
</analyzer>

This is what the index configuration patch looks like.

<sitecore>
  <contentSearch>
    <configuration>
      <indexes>
        <index id="sitecore_master_index" type="Sitecore.ContentSearch.LuceneProvider.LuceneIndex, Sitecore.ContentSearch.LuceneProvider">
          <configuration>
            <patch:attribute name="ref">contentSearch/indexConfigurations/snowballLuceneIndexConfiguration</patch:attribute>
          </configuration>
        </index>
      </indexes>
    </configuration>
  </contentSearch>
</sitecore>

I have a lot more to say on this topic, but for now I’ll let this alone as this is sufficient for anyone to experiment with the SnowballAnalyzer for Sitecore.