The Game Is Afoot . . . Solr Shenanigans for Sitecore (Part 2)

Following on from Part 1 where I introduced what I’m up to here, let me jump right in to the other 5 Shenanigans for Solr + Sitecore:

Case 4 – The case of the default query crippler

  • In this scenario, a customer’s Solr was straining to the breaking point and we tracked it down to a set of circumstances where Sitecore was using a default value for ContentSearch.SearchMaxResults (defaults to “” which is int.MaxValue which is 2,147,483,647) and flooding Solr with essentially unbounded queries. That default is downright dangerous. The query logs showed the queries using a rows value of int.MaxValue in rapid succession:
  • INFO Solr Query – ?q=associated_docs:(“\*A6C71A21-47B5-156E-FBD1-B0E5EFED4D33\*”)&rows=2147483647&fq=_indexname:(domain_index_web)
    INFO Solr Query – ?q=((_fullpath:(\/sitecore/content/Branches/XYZ/*) AND _templates:(d0321826b8cd4f57ac05816471ba3ebc)))&rows=2147483647&fq=_indexname:(domain_index_web)
  • Solr will set aside some memory for the 2,147,483,647 results even if the dataset isn’t that large. I discussed a scenario like this in detail in this earlier post from 2018
  • This write-up on Solr and “Large Number of Rows” speaks exactly to this scenario: https://risdenk.github.io/2018/10/21/apache-solr-out-of-memory-symptoms-and-solutions.html

Case 5 – The case of the bandwidth blowout

  • Network bandwidth usage was off the charts for the customer we considered in this scenario. It took some digging, but we discovered it was due to a 23 GB Solr core being replicated across data centers. If one viewed the replication panel in the Solr UI, one could see the slow creep of the replication progress bar and it would never reach 100% complete before starting over.

shen5

    • There was additional supporting material such as Solr WARN messages etc:

shen6

    • The network latency was too much for Solr master/slave replication to complete it’s work, but Solr kept on trying to move that 23 GB Solr core across the planet . . . and since this was the sitecore_analytics_index it was kept very busy by Sitecore. It all made for a feedback loop of frequent updates to the analytics index that couldn’t properly synchronize between data centers.
    • For this particular scenario, we determined that there wasn’t a need to replicate the sitecore_analytics_index (it was consumed only by the CM environment which didn’t require the geographical scaling through Solr replication). We disabled the master/slave replication for that specific Solr core and the tidal wave of network traffic stopped. Case closed!

Case 6 – The case of the misguided, well-intentioned, administrator

  • This scenario introduces a server administrator into the equation, and they actually cause more harm than good. The “optimize now” button in the Solr UI lured this administrator into clicking it without understanding the consequences:

OptimizeNow.JPG

  • I posted about this in detail last year, so I won’t dig too far into it here, but the gist of this scenario illustrates how Solr internally organizes files and that there are questionable UI choices for that Optimize Now button. It makes it look like an easy way for one to improve Solr performance when — in reality — clicking that Optimize Now can be pretty expensive in terms of perf, especially for a volatile Solr core.

Case 7 – The case of the AppPool recycle-fest

  • This scenario is one from a couple years ago, but it’s still relevant as a cautionary tale. For a long time, if Sitecore lost an active connection to Solr, the only option was to recycle the Sitecore AppPool. For a Solr server restart, or service restart, or even a transient network failure . . . Sitecore would need to run through the application initialization logic to reacquire Solr connectivity. In this specific case, there was a recurring network issue that interfered with Sitecore’s connectivity to Solr, so the customer scheduled IIS AppPool recycles every 15 minutes to ensure a fresh connection to Solr was available. This AppPool recycle-fest has terrible consequences for website performance as the site is constantly spending time on recycles and the related pipeline of events.
  • This case highlights why there are now more elegant ways of handling this; I recently blogged about the IsSolrAliveAgent designed to solve this exact problem. There’s periodic logic to reconnect Sitecore with Solr now, and it’s important to appreciate why it’s there and — probably — why you may want to tune the default setting of every 10 minutes for your production environment.

That’s the 7 Shenanigans related to Sitecore and Solr from my talk earlier in October. It’s a fun paradigm for learning more about the overlaps of Sitecore and Solr and I hope it helps others to get more from their Sitecore + Solr technology stack!

The Game Is Afoot . . . Solr Shenanigans for Sitecore (Part 1)

I took the challenge of presenting at the Manchester, New Hampshire Sitecore User Group a few days ahead of the 2018 Sitecore Symposium. I say challenge because

  1. Delivering content that isn’t superseded by the Sitecore Symposium agenda can be difficult (all the good Commerce or Azure or DevOps material would be saved for Symposium week)
  2. I would be following Michael West and his showcase of the new Sitecore PowerShell Extensions version 5.0 module (curious that https://doc.sitecorepowershell.com/releases doesn’t list the 5.0 yet, but I’m sure it’s coming there too).

As I finalized my topic, I surveyed the work I’ve been up to recently and figured I could take my talk in one of two directions that would be generally absent from the Sitecore Symposium agenda: Sitecore and Solr, or The Heresy of Sitecore on AWS. While we are doing some interesting things around AWS with RDS, ElastiCache, that’s officially unsupported territory with Sitecore and not fully baked enough for me to present it as anything approaching a best practice — but check back with me in 6 months. So, I elected to give a talk on Solr and explore some of the lessons learned from years in the trenches making Sitecore successful with Solr; the topic was finalized as Solr, Sitecore 9, 7 Shenanigans:

shenanigans
The title slide from the talk

I covered some of the history and underpinnings of Solr with regards to Sitecore, the dependence on Solr.Net (which is <important>NOT</important> a port of Solr to .Net the way Lucene.Net IS a port of Java’s Lucene — and why we should care), and common architecture patterns for Sitecore integrations based on Solr master/slave and Solr Cloud. I guess I’ve blogged a lot about Solr over the years; for instance, here and here are a couple sample areas I delved into.

I think the most fun part of the talk was the Shenanigans, however, as I went with a Sherlock Holmes theme to frame the conversation. I reviewed 7 cases and we had fun digging into some of the diagnostic bits.

shenanigans2

Here’s a quick run down of the first 3 Shenanigans:

  1. The case of the disappearing Java
    • Where we started with Solr that wouldn’t start for a set of Production servers  . . .

shen1.JPG

    • . . . and eventually solved the case by determining the development team installed a Java SDK with auto-update enabled and the system had removed the Java identified in the ClassPath. This is a brutal one for a Production implementation!
  1. The case of the underachieving mega-server
    • These OutOfMemoryErrors are not fun and can be due to a variety of issues:
      • shen2.JPG
    • In this case, we determined this 32 GB server was running Solr with the default 512 MB of memory set aside for Solr . . . a pretty fundamental issue:3.512 MB - default of 32 GB
    • We tuned the Solr start settings to use more of the server capacity for Java and Solr, in this case I think we used 10 GB as a starting point, and solved this specific case. This isn’t the last we’ll hear about OutOfMemory errors in our Shenanigans, however, as there can be many causes (see this great summary published just yesterday, for example).
  2. The case of the disappearing content
    • This scenario had a public-facing website’s content disappear periodically after content publishes . . . sound familiar to anyone? It was due to the threshold for full rebuilding of the search indexes after a Sitecore publish set low enough to trigger regularly
      • For the record, the setting is ContentSearch.FullRebuildItemCountThreshold and the default is 100,000 (0x186a0 – 100,000):Shen3

    • This customer didn’t have SwitchOnRebuild implemented for these key public-facing search indexes, so the first step of the index rebuild logic was to remove all documents from the Solr collection, then add the documents back in as the indexing process ran it’s course.  To the site visitor it created missing or inconsistent search results while the rebuilding took place, and for a large set of items it can take 60 minutes or longer for rebuilding.
    • The solution is to use the SwitchOnRebuild implementation for their Sitecore search indexes – https://doc.sitecore.net/sitecore_experience_platform/setting_up_and_maintaining/search_and_indexing/indexing/switch_solr_indexes and related documentation from Sitecore covers this process.

I will cover the remaining four Solr + Sitecore Shenanigans in my next post; here’s a teaser for the topics:

  • Case #4 – The case of the default query crippler
  • Case #5 – The case of the bandwidth blowout
  • Case #6 – The case of the misguided, well-intentioned, administrator
  • Case #7 – The case of the AppPool recycle-fest

As Sherlock Holmes would say: “The game is afoot!”

The tale of the IsSolrAliveAgent for Sitecore

I had the pleasure of assessing a Sitecore 7 implementation the other day and it made me a bit nostalgic for the good old days — OMS had grown into DMS, all nicely contained in a tidy SQL Server database; barely a whisper of Solr; ItemBuckets were the new kid on the block.

It got me thinking about how some features can evolve over time until the obscure becomes mainstream. One example of this feature evolution, that for some reason has relevance to a variety of customers we’re working with right now, is in how Sitecore maintains connections with Solr (or should I say how Sitecore doesn’t maintain connections?).

Many of us have learned hard lessons that if you have a Solr server and it reboots or experiences an interruption in service, it can mean downtime for your Sitecore site unless you take special measures to guard against it. Sitecore’s initialization process connects to Solr and then holds that connection for the lifetime of the IIS AppPool. If that connection fails for any reason, there wasn’t logic in the Sitecore Solr Provider for a graceful reconnect. At least, that was the case for several iterations of Sitecore’s standard Solr search integration.

A few years ago, a basic approach evolved through the Sitecore developer community to correct the above limitation with Sitecore; it may have come from Sitecore support, but it wasn’t publicized in any way. One could explicitly repeat Sitecore’s initialization logic for Solr in a custom agent, scheduled to run periodically. This was a custom solution and I saw a few iterations of it — mostly a brute force approach. But it worked.

In Sitecore 8.2 update-1,  buried in the Release Notes is this fragment relevant to the story:

Solr

The Sitecore.ContentSearch.SolrProvider, from that point onward, contained a new agent defined in Sitecore.ContentSearch.Solr.DefaultIndexConfiguration.config named IsSolrAliveAgent that would serve as a retryer for Solr if the connectivity was lost.  It was configured to run every 10 minutes for a default implementation.

By the way, 10 minutes is probably too long an interval in my experience — even 1 minute can be too long for a production environment to wait before trying to reconnect a key component such as Solr. Also, if you set the agent to 1 minute, but have a /sitecore/scheduling/frequency value defined as something like 10 minutes, you need to change the frequency value to ensure the IsSolrAliveAgent is executed on the schedule you expect.

What had been a little-known approach to keeping Sitecore connected to Solr across interruptions had made the big time: it was now part of the official Sitecore search provider code!

This first implementation included in the Sitecore code base wasn’t perfect, though, and it was improved upon in subsequent releases of Sitecore. Improvements include better logging . . . more efficient iteration through the search indexes . . . but the general approach remains the same.

At this point, there’s an official patch for the IsSolrAliveAgent that Sitecore makes available at https://github.com/SitecoreSupport/Sitecore.Support.163850.171950/releases/tag/8.2.6.0 and please note the compatibility caveats on that repo. We do have a Sitecore 8.2 update-5 customer making use of the patch without issue, even though it’s not officially listed as compatible for that version — but that could be exceptional in our case, so always perform your own evaluations, tests, etc. This patch addresses a known issue where, if Solr is unavailable during Sitecore initialization, the SwitchOnRebuildSolrCloudSearchIndex indexes are not properly initialized.

It’s interesting — at least to a Sitecore/Solr nerd like me — to decompile and compare all the changes over time to this component; there’s logging and tests and generally better code present in the newest iteration of this IsSolrAliveAgent vs the earlier implementations. Some of the changes are subtle, but the evolution of this IsSolrAliveAgent from the days of “hey, we’ve got this homegrown ten lines of code that reconnects Sitecore to Solr if Solr goes down” is remarkable.

In some ways, it parallels the progress of Sitecore as an entire platform. The CMS became a “personalization platform” which now is adding a Commerce ecosystem. OMS became DMS which became xDB. On we go.

At this point, adding a patch .config file such as the following to your Sitecore project is the state of the art in Sitecore and Solr connectivity . . . but it surely will be improved upon over time:

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/" xmlns:set="http://www.sitecore.net/xmlconfig/set/">
  <sitecore>
    <scheduling>
      <agent type="Sitecore.ContentSearch.SolrProvider.Agents.IsSolrAliveAgent, Sitecore.ContentSearch.SolrProvider">
        <patch:attribute name="type">Sitecore.Support.ContentSearch.SolrProvider.Agents.IsSolrAliveAgent, Sitecore.Support.163850.171950</patch:attribute>
        <patch:attribute name="interval">00:01:00</patch:attribute>
      </agent>
    </scheduling>
  </sitecore>
</configuration>

 

Sitecore’s SessionDictionaryData class saves the day

More and more projects are using Azure SQL as the database back-end for Sitecore (so long as they’re running Sitecore 8.2 and newer — if alignment to Sitecore official support guidance is important to you). This sets up a new class of performance considerations around Azure SQL, and I want to share one tuning option we learned while investigating high DTU usage for the Sitecore xDB “ReferenceData” database in a Sitecore 9 PaaS build. We wanted to off-load some of the work this “ReferenceData” database was doing, and investigations into which Azure SQL queries were causing the DTU spikes pointed to INNER JOINs between the ReferenceData.DefinitionMonikers and ReferenceData.Definitions tables.

Sitecore support pointed us in the right direction at this juncture, since the default DictionaryData was using AzureSQL for persistence — we should consider a store more suited to rapid key/value access. If this sounds like a job for Redis, you’d be correct, and fortunately Sitecore has an implementation that’s suited for this type of dictionary access in the Sitecore.Analytics.Data.Dictionaries.DictionaryData.Session.SessionDictionaryData class.

The standard Sitecore pipeline we’re talking about is the getDictionaryDataStorage pipeline and it’s used by Sitecore Analytics to store Device, UserAgent, and other key/value pair lookups. Here’s it’s definition:

The alternative we moved to is to use session state for storing that rapidly requested data,  so we updated the DictionaryData node to instead use the class Sitecore.Analytics.Data.Dictionaries.DictionaryData.Session.SessionDictionaryData. For this Azure PaaS solution, it amounts to using Azure Redis for this work since that’s where the session state is managed. Here’s the new definition:

What this boils down to is the implementation in Sitecore.Analytics.DataAccess.dll of Sitecore.Analytics.DataAccess.Dictionaries.DataStorage.ReferenceDataClientDictionary was shown to be a performance bottleneck for this particular project, so changing to use the Sitecore.Analytics.dll with it’s Sitecore.Analytics.Data.Dictionaries.DictionaryData.Session.SessionDictionaryData aligns the project to a better-fit persistence mechanism.

We considered if we could improve upon this progress by extending the SessionDictionaryData class to be IIS in-memory regardless of the Sitecore session-state configuration; there would be no machine boundary to cross to resolve the (apparently) volatile data. Site visitors would require affinity to a specific AppService host in Azure, though, with this and it’s possible – or even likely — that Sitecore assumes this is shared state across an entire implementation. We talked ourselves out of seriously considering a pure IIS in-memory solution.

I think it’s possible we could improve the performance with the default ReferenceDataClientDictionary by tuning any caches around this analytics data, but I didn’t look into that since time was of the essence for this investigation and the SessionDictionaryData class looked like such a quick win. I may revisit that in the next iteration, however, depending on how this new solution performs over the long term.

Updating Sitecore 9 to connect to Solr over HTTP instead of HTTPS

It’s swimming against the tide to use HTTP instead of HTTPS with Sitecore and Solr, but there were a series of circumstances that had us reviewing the SSL default and evaluating other options. I had to go to Sitecore support for clarification on setting up Solr over HTTP instead of HTTPS (now that HTTPS is the default in the Sitecore 9 space). There wasn’t succinct documentation from Sitecore, so let me share the findings here in case it helps others . . .

For the standard Sitecore CM and CD roles, one can update the  ContentSearch.Solr.ServiceBaseAddress URL — just use http instead of https.

For the Sitecore xConnect role, one needs to change the following configuration:

1) For the Collection instance, update the App_data\config\sitecore\CollectionSearch\sc.Xdb.Collection.IndexReader.SOLR.xml file:

<Options>
<ConnectionStringName>solrCore</ConnectionStringName>
<RequireHttps>false</RequireHttps>
</Options>

2) For the xConnect Indexer instance, update the App_data\jobs\continuous\IndexWorker\App_data\Config\Sitecore\CollectionSearch\sc.Xdb.Collection.IndexReader.SOLR.xml file:

<Options>
<ConnectionStringName>solrCore</ConnectionStringName>
<RequireHttps>false</RequireHttps>
</Options>

3) Also for the xConnect Indexer instance, update the App_data\jobs\continuous\IndexWorker\App_data\Config\Sitecore\SearchIndexer\sc.Xdb.Collection.IndexWriter.SOLR.xml file:

<Options>
<ConnectionStringName>solrCore</ConnectionStringName>
<RequireHttps>false</RequireHttps>
<Encoding>utf-8</Encoding>
</Options>

<Options>
<ConnectionStringName>solrCore</ConnectionStringName>
<RequireHttps>false</RequireHttps>
<MaximumUpdateBatchSize>1000</MaximumUpdateBatchSize>
<MaximumDeleteBatchSize>1000</MaximumDeleteBatchSize>
<MaximumCommitMilliseconds>1000</MaximumCommitMilliseconds>
<ParallelizationDegree>4</ParallelizationDegree>
<MaximumRetryDelayMilliseconds>5000</MaximumRetryDelayMilliseconds>
<RetryCount>5</RetryCount>
<Encoding>utf-8</Encoding>
</Options>

Curious case of the LocationsDictionaryCache

We have some cache monitoring in place for some enterprise Sitecore customers and that system has found one particular cache being evicted roughly 40 times per hour for one particular customer. It’s curious, as this cache isn’t covered in the standard documentation for Sitecore’s caches. Sitecore support had to point me in the right direction on this . . . and it’s an interesting case.

The LocationsDictionaryCache cache is defined in Sitecore.Analytics.Data.Dictionaries.LocationsDictionary and there’s a hard coded 600 second expiration in that object definition:

Expires

There’s also a maxCacheSize set of 0xf4240 hex (1,000,000 or 976 KB). You can’t alter these settings through configuration, you’d have to compile a new DLL to alter these values.

It’s not clear to me that the quick eviction/turn-over of this cache is a perf issue to worry about . . . I think, at this point, it’s working as expected and indicative of a busy site with lots of Sitecore Analytics and Tracking behaviour. Reviewing the decompiled Sitecore code that uses this class (like in Sitecore.Analytics.Tracking.CurrentVisitContext or Sitecore.Analytics.Pipelines.CommitSession.ProcessSubscriptions), it appears that this cache serves as a short-term lookup along the same lines as devices, user-agents, etc. Why this particular cache is active in ways UserAgentDictionaryCache and others is not, besides the obvious 600 second life span, is something we need to dig further into — but I don’t know that it’s a perf bottleneck for our given scenario.

 

Which Solr Node Is Responding to My Sitecore Query?

In working with Sitecore + Solr, eventually you may need to determine which Solr server responded to a specific query in order to validate Solr replication or to compare Solr responses across machines. If you can’t imagine a reason why you’d want to do this, you’re probably fortunate and should maybe go buy a lottery ticket instead of reading this blog post 🙂

Brute Force Method with Solr Master/Slave

If you’re working with Solr master/slave behind a load-balancer, or with multiple slaves behind a load-balancer, I haven’t found a reliable way of determining which Solr server responded to a particular query besides the brute force method of comparing Sitecore and Solr logs. Specifically, from the Sitecore logs in Data\logs\Search.Logs.[Timestamp].txt you should see something like the following for each query:

Sitecore’s Query Log

15:11:48 INFO Serialized Query – ?q=(_template:(f613d8a8d9324b5f84516424f49c9102) AND (-_name:(“__Standard Values”) AND _language:(en-US)))&start=0&rows=1&fl=*,score&fq=_indexname:(sitecore_web_index)&sort=searchdate_tdt desc

Solr’s Query Log

You can cross-reference this Sitecore log data with Solr logs (like S:\solr-4.10.4\sitecore\logs\solr.log):

INFO – 2018-08-01 15:11:48.514; org.apache.solr.core.SolrCore; [sitecore_web_index] webapp=/solr path=/select params={q=(_template:(f613d8a8d9324b5f84516424f49c9102)+AND+(-_name:(“__Standard+Values”)+AND+_language:(en-US)))&fl=*,score&start=0&fq=_indexname:(sitecore_web_index)&sort=searchdate_tdt+desc&rows=1&version=2.2} hits=2958 status=0 QTime=16

It’s tedious to match up the exact query and time in these logs, but the Solr node with the matching record will reveal which one serviced the request.

Now, one can craft some PowerShell to scrounge the Sitecore and Solr logs and determine where we have matches — crazy as it may sound — but I’m not interested in sharing all that here. It’s academic to read the two logs and look for matches, anyway, so I’ll move on to the Solr Cloud scenario that is more interesting and forward-looking since Sitecore is steadily progressing towards full Solr Cloud support across the board.

Solr Cloud Debug Query Command

Solr Cloud supports a debug command where you append debug=true to the URL and Solr includes diagnostic output in the results. For example, a RESTful query to Solr like http://10.215.118.28:8983/solr/sitecore_web_index_shard1_replica2/select?q=_name%3ANEWS&wt=xml&indent=true&amp;debug=true. Using the XML formatting, debug=true adds something like this to the response from Solr:

Capture

There can be interesting tidbits in each of those debug sections, but I’m going to focus on the track node that shares information about the different phases of the distributed request Solr is making. Under the “EXECUTE_QUERY” item is a “name” attribute that will specify which Solr nodes, shards, and replica were involved in responding to the query, for example:

<lst name=”http://10.215.140.12:8983/solr/sitecore_web_index_shard2_replica1/|http://10.215.140.13:8983/solr/sitecore_web_index_shard2_replica2/”>

I’ve also found the “shard.url” value of the Response (nested under the EXECUTE_QUERY data) to share the same information. It’s possible that’s more reliable across Solr versions etc, but something to keep an eye on. Here’s a fragment of the XML response for the debug information:

Capture

A careful reader might point out that the “rid”  value includes the IP address of the server responding to the request, but this is designed to be the “request ID” that traces the query through Solr’s various moving parts — I wouldn’t rely on the “rid” to tell you the source for the response, though, as it could be changing across versions.

Here’s a quick run through of the other diagnostic data in that EXECUTE_QUERY data that I know about:

  1. QTime: the number of milliseconds it took Solr to execute a search with no regard for time spent sending a response across the network etc
  2. ElapsedTime: the number of milliseconds from when the query was sent
    to Solr until the time the client gets a response. This includes QTime, assembling the response, transmission time, etc.
  3. NumFound: the count of results

There is a ton to all this and we’re only scratching the surface, but as Sitecore gets more serious about scalable search with Solr, we’re all going to be learning a lot more about this in the months and years to come!

SitecoreSupport GitHub is Back!

It looks like Sitecore has returned public visibility to https://github.com/SitecoreSupport

This is a GitHub repo that Sitecore’s official support team uses to document, track, and communicate the various patches they make available to customers. Public access was removed about a year ago, and some of us in the community really felt the loss (and wished we had downloaded the entire repo to use as a reference!).

For me, that repo is a wealth of information on how to tackle specific Sitecore challenges. It’s instructive to see the strategies involved with addressing a Solr integration defect, just for example, and can be applicable to a lot of different customer solutions.

Capture
Screen shot of a hotfix from the repo

A lot of my time the last few months has been focused on Solr integrations and connecting different Sitecore systems together, so having examples like https://github.com/sitecoresupport/Sitecore.Support.198901/releases or https://github.com/SitecoreSupport/Sitecore.Support.449298 can be like a master course in how the internals of Sitecore fit together.

I’m not suggesting you take the code on that repo and blindly apply it to your Sitecore implementation, but as a learning reference and source of how Sitecore’s internals behave, it’s a vital resource to those advanced Sitecore professionals.

Sitecore and SearchMaxResults for Solr

I’ve consulted with a number of Sitecore implementations in the last month or two that had a variety of challenges with Sitecore integration with Solr. Solr is a powerful distributed system with a variety of angles of interest to Sitecore. I wanted to take this chance to comment on a Sitecore setting that can have a significant impact on how Sitecore search functions, but is easily overlooked. The setting is defined in Sitecore.ContentSearch.config and it’s called ContentSearch.SearchMaxResults. The XML comment for this setting is straight-forward, here’s how it’s presented in that file:

snip

There’s a lot to digest in that xml comment above. One could read it and assume “this can be set but it is best kept as the default” means this shouldn’t be altered, but in my experience that can be problematic.

The .Net int.MaxValue constant is 2,147,483,647. If you leave this setting at the default (so “”), one is telling Solr to return up to 2,147,483,647 results in a single response to a query, which we’ve observed in some cases to cause significant performance problems (Solr will fetch the large volume of records from disk and write them to the Solr response causing IO pressure etc.) It’s not always the case since it really comes down to the number of documents one is querying from Solr, but this sets up the potential for a virtually unbounded Solr query.

It’s interesting to trace this setting through Sitecore and into Solr, and it sheds light on how these two complex systems work together. Fun, right!? I cooked up the diagram below that shows an overview of how Sitecore and Solr work together in completing search queries:

snipp

Each application has it’s own logging which will help trace activity between the systems.

The Sitecore ContentSearch Provider for Solr relies on Solr.Net for connectivity to Solr. It’s common for .Net libraries to copy their open source equivalents from the Java world (like Log4J has a .Net port for logging named Log4net, Lucene has a .Net port for search called Lucene.Net, etc). Solr.Net, however, is not a port of the Solr Java application to .Net. Instead, Solr.Net is a wrapper for the main Solr API elements that can be easily called by .Net applications. When it comes to Sitecore’s ContentSearch Provider for Solr, Solr.Net is Sitecore’s bridge for getting data to and from the Solr application.

Just an aside: some projects do creative things with Solr search and Sitecore, and for certain scenarios it’s necessary to bypass Solr.Net and use the REST API directly from Sitecore. This write-up focuses on a conventional Sitecore -> Solr.Net -> Solr pattern, but I wanted to acknowledge that it’s not the only pattern.

Tracking ContentSearch.SearchMaxResults in Sitecore

On the Sitecore side, one can see the ContentSearch.SearchMaxResults setting in the Sitecore logs when you turn up the diagnostics to get more granular data; this isn’t a configuration that’s recommended for using beyond a discrete troubleshooting session as the amount of data it can generate can be significant . . . but here’s how you dial up the diagnostic data Sitecore reports about queries:

snip3

If you run a few queries that exercise your Sitecore implementation code that queries Solr, you can review the contents of the Search log in the Sitecore /data directory and find entries such as:

INFO Solr Query – ?q=associated_docs:(“\*A5C71A21-47B5-156E-FBD1-B0E5EFED4D33\*”)&rows=2147483647&fq=_indexname:(domain_index_web)

or

INFO  Solr Query – ?q=((_fullpath:(\/sitecore/content/Branches/ABC/*) AND _templates:(d0351826b1cd4f57ac05816471ba3ebc)))&rows=2147483647&fq=_indexname:(domain_index_web)

The .Net int.MaxValue 2147483647 is what Sitecore, through Solr.Net, is specifying as the number of rows to return from this query. For Solr cores with only a few hundred results matching this query, it’s not that big a deal because the query has a fairly small universe to process and retrieve. If you have 100,000 documents, however, that’s a very heavy query for Solr to respond to and it will probably impact the performance of your Sitecore implementation.

Tracking ContentSearch.SearchMaxResults in Solr

Solr has it’s own logging systems and this 2147483647 value can be seen in these logs once Solr has completed the API call. In a default Solr installation, the logs will be located at server/logs (check your log4j.properties file if you don’t know for sure where your logs are being stored) and you can a open up the log and scan for our ContentSearch.SearchMaxResults setting value. You’ll see entries such as:

INFO  – 2018-03-26 21:20:19.624; org.apache.solr.core.SolrCore; [domain_index_web] webapp=/solr path=/select params={q=(_template:(a6f3979d03df4441904309e4d281c11b)+AND+_path:(1f6ce22fa51943f5b6c20be96502e6a7))&fl=*,score&fq=_indexname:(domain_index_web)&rows=2147483647&version=2.2} hits=2681 status=0 QTime=88

  • The above Solr query returned 2,681 results (hits) and the QTime (time elapsed between the arrival of the query request to Solr and the completion of the request handler) was 88 milliseconds. This is probably no big deal as it relates to the ContentSearch.SearchMaxResults, but you don’t know if this data will increase over time…

INFO  – 2018-03-26 21:20:19.703; org.apache.solr.core.SolrCore; [domain_index_web] webapp=/solr path=/select params={q=((((include_nav_xml_b:(True)+AND+_path:(00ada316e3e4498797916f411bc283cf)))+AND+url_s:[*+TO+*])+AND+(_language:(no-NO)+OR+_language:(en)))&fl=*,score&fq=_indexname:( domain_index_web)&rows=2147483647&version=2.2} hits=9 status=0 QTime=16

  • The above Solr query returned 9 results (hits) and the QTime was 16 milliseconds. This is unlikely a problem when it comes to ContentSearch.SearchMaxResults.

 INFO  – 2018-03-26 21:20:19.812; org.apache.solr.core.SolrCore; [domain_index_web] webapp=/solr path=/select params={q=(_template:(8ed95d51A5f64ae1927622f3209a661f)+AND+regionids_sm:(33ada316e3e4498799916f411bc283cf))&fl=*,score&fq=_indexname:(domain_index_web)&rows=2147483647&version=2.2} hits=89372 status=0 QTime=1600

  • The above Solr query returned 89,372 results (hits) and the QTime was 1600 milliseconds. Look out. This is the type of query that could easily cause problems based on the Sitecore ContentSearch.SearchMaxResults setting as the volume of data Solr is working with is getting large. That query time is already climbing high and that’s a lot of results for Sitecore to require in a single request.

The impact of retrieving so many documents from Solr can cause a cascade of difficulties besides just the handling of the query. Solr caches the query results in memory and if you request 1,000,000 documents you could also be caching 1,000,000 million documents. Too much of this activity and it can stress Solr to the breaking point.

Best Practices

There is no magic value to set for ContentSearch.SearchMaxResults other than not “”. General best practice when retrieving lots of data from most any system is to use paging. It’s recommended to do that for Sitecore ContentSearch queries, too. A general recommendation would be to set a specific value for the ContentSearch.SearchMaxResults setting, such as “500” or “1000”. This should be thoroughly tested, however, as limiting the search results for an implementation that isn’t properly using search result paging can lead to inconsistent behavior across the site. Areas such as site maps, general site search, and other areas with implementation logic that could assume all the search results are available in a single request deserve special attention when tuning this setting.

What About Noisy Solr Neighbors?

I’ve worked on some implementations where Solr was a resource shared between a variety of Sitecore implementations. One project, in this example, might set ContentSearch.SearchMaxResults to “2000” for their Sitecore sites while another project sets a value of “500” – but what if there’s a third organization making use of the same Solr resources and that project doesn’t explicitly set a value for ContentSearch.SearchMaxResults? That one project leaves the setting at the “” default, so it uses the .Net int.MaxValue. This is a classic noisy neighbor problem where shared services become a point of vulnerability to all the consuming applications. The one project with “” for ContentSearch.SearchMaxResults could be responsible for dragging Solr performance down across all the sites.

Solr is an extensible platform much like Sitecore, and in some ways even more so. In Sitecore one extends pipelines or overrides events to achieve the customizations you desire; the same general idea can be applied to Solr – you just use Java with Solr instead of C#.

In this case, our concern being unbounded Solr queries, we can use an extension to a search component (org.apache.solr.handler.component.SearchComponent) to introduce our custom logic into the Solr query processing. In our case, we want to enforce limits to the number of rows a query can request. This would be a safety net in case an un-tuned Sitecore implementation left a ContentSearch.SearchMaxResults setting at the default.

Some care must be taken in how this is introduced into the Solr execution context and where exactly the custom logic is handled. This topic is all covered very well, along with sample code etc, at https://jorgelbg.wordpress.com/2015/03/12/adding-some-safeguard-measures-to-solr-with-a-searchcomponent/. For an enterprise Solr implementation with a variety of Sitecore consumers, a safety measure such as this could be vital to the general stability and perf of the platform – especially if one doesn’t have control over the specific Sitecore projects and their use (or abuse!) of the ContentSearch.SearchMaxResults setting. Maybe file this under best practices for governing Sitecore implementations with shared Solr infrastructure.

Sitecore Commerce 8.2.1 MSCS_Admin Database & the Tyranny of SQLOLEDB.1

Disclaimer: this is a journey of discovery and not a manual on database best practices for Sitecore Commerce 8.2.1.

You don’t see a lot of talk about disaster recovery and Sitecore Commerce — getting the regular implementation to work well is enough of a challenge that DR and Sitecore Commerce feels like a mythical Phase 5 of a project with only 4 Phases.

I want to write-up some notes and exploratory digging I did recently on the topic of disaster recover and Sitecore Commerce 8.2.1. You could even consider it a poor man’s option for database high availability that doesn’t incur the licensing costs of SQL Server Always On . . . but it’s probably best if you forget I ever mentioned that. Just because something can be done, doesn’t mean it should be done. I get in trouble sometimes presenting cans when I should just stick to the shoulds, but sometimes there’s real opportunity in those cans so I just can’t resist. Reader: beware.

Enough preamble. To simplify the scope of this post, I’m going to focus on database disaster recovery since it’s distinct for Sitecore Commerce 8 from “regular” Sitecore without Commerce. It’s because Commerce has a decade (or two!?) of COM code and legacy architectures that are way down deep in the Sitecore Commerce 8.2.1 system.

The crux of the challenge I was tasked with addressing was the inability to identify a SQL Mirror as part of a Sitecore Commerce 8.2.1 project. Many customers have used SQL Server database “Mirroring” as the high availability option for Sitecore databases for a long time because it was the only one officially supported by Sitecore. As this documentation explains, only since Sitecore 8.2 has “Always On” been an option for an officially supported Sitecore implementation. I know — many projects are successful with newer SQL Server approaches or RDS on AWS etc, but in my role at Rackspace, we have to walk the line of what Sitecore officially supports from top to bottom to ensure clean lines of escalation in the event of any issue; this is appealing to risk-averse customers, those with aggressive SLAs, etc.

To use SQL Server Mirroring one must identify a failover partner in each of your SQL Server database connections defined in ConnectionStrings.config like:

FailoverPartner

In Sitecore Commerce 8.2.1, however, this is the interface you have to manage a database connection for the MSCS_Admin database . . . there is a single endpoint (server name) and no provision for a failover partner. It comes down to a limit of the SQLOLEDB.1 provider, I believe. It’s fine for a SQL Server Availability Group listener where you get a service to route requests between the Always On SQL Server nodes, but this Commerce UI is incompatible with SQL Server Database Mirroring:

DRCommerce

I set to digging and found some old documentation on a Windows Registry key for Commerce Server 2007 edition and the MSCS_Admin database. I’ll assume you understand the primacy of the MSCS_Admin database for Sitecore Commerce 8.2.1 — if that’s not the case, you can review this material for background, but trust when I explain that MSCS_Admin is the administrative heart of Sitecore Commerce 8.2.1. This SQL query shows how the ResourceProps table in MSCS_Admin stores all the dependent database connections for Inventory, Catalog, Profiles, etc:

Now, the old documentation I found mentions an encrypted registry key named ADMINDBPS that is where Commerce Server Manager, the desktop tool, reads and writes the actual database connection string for the MSCS_Admin database. Since I can’t insert a DB Mirroring Failover Partner into the desktop tool, I figured I could engineer a work around using this registry key as leverage.

The problem, however, is this documentation I was reading was from 2006 and no longer reflected reality. It also mentioned how this approach wasn’t supported by Microsoft and came with every cautionary disclaimer. Sounds like fun, right? The Windows Registry Key schema had changed, but not the overall approach and after doing some digging I found HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\CommerceServer was where this newer version of Sitecore Commerce had reorganized the Registry state and that’s where ADMINDBPS was hiding!

admindbps

The connection string to MSCS_Admin was still encrypted . . . sure . . . but nothing the .Net System.Security.Cryptography namespace couldn’t resolve.

The connection string to MSCS_Admin was still buried in the Windows Registry . . . sure . . . but nothing the .Net Microsoft.Win32 namespace couldn’t resolve.

Here’s the code I used to read this value:

RegWork

Since we’re getting this value, we might as well just add the set logic too, right?

RegWork2

Yes, let the Registry edits flow through you!

I told you this was a reader: beware type of post.

Taking a step back, remember my original challenge was finding a way to weave support for SQL Server DB Mirroring into Sitecore Commerce 8.2.1 and I found the MSCS_Admin database connection to be my focus? We can wrap the above c# code into a command line tool that can be run in a database failover situation to update the MSCS_Admin database connection automatically — changing to the intended failover partner defined as part of the conventional SQL Server Database Mirroring. We could schedule this c# logic to run once every 30 seconds in Windows Task Scheduler, for example, conditionally changing the MSCS_Admin database connection in the event of a failover scenario. The MSCS_Admin database connection string could update automatically. I’m not going to present this as a viable alternative to SQL Server Always On for many many reasons, but you achieve some semblance of it with robust enough PowerShell.

If you’re scratching your head here because I work with customers who are risk averse or have critical Sitecore SLAs that would still rely on the above Registry hacks, you’re not quite getting my point. I’m sharing this because it’s possible, it’s interesting to know how the guts of Sitecore Commerce 8.2.1 are working, and others may also learn from this. While it’s theoretically an option to introduce some automated failover logic through this method, it’s an uncomfortable hack. Based on my testing, though, it works.

Honestly, this technique is most appropriate in a disaster recovery scenario where a set of database servers are unavailable (and then this can apply whether using SQL Server  Always On or DB Mirroring — doesn’t really matter). Instead of considering this a high availability solution, it’s a DR solution.

I spoke with some on the Sitecore Commerce technical team about this and they agreed this was a bit crazy, but it works (and also I shouldn’t quote them directly). They also pointed out how you don’t have to store MSCS_Admin connections in the Windows Registry and that as part of the PaaS support evolving through Sitecore Commerce there was a “Registry Free” deployment option for Sitecore Commerce 8.2.1 that I didn’t know about for this version. With this technique, you can use a ConnectionString defined like the others for Sitecore (see some details here)

<connectionStrings>
 <add name="ADMINDBPS" connectionString="<your MSCS_Admin connection string" />
</connectionStrings>

I haven’t experimented with the Registry Free deployment for Commerce 8.2.1 but I’d like to see if it avoids the tyranny of the SQLOLEDB.1 provider and would let us add the Failover Partner logic to a connection string. I think this blog post may have a Part 2 . . . but I’m not sure how much further down this rabbit hole it’s worth tunneling.