The Game Is Afoot . . . Solr Shenanigans for Sitecore (Part 2)

Following on from Part 1 where I introduced what I’m up to here, let me jump right in to the other 5 Shenanigans for Solr + Sitecore:

Case 4 – The case of the default query crippler

  • In this scenario, a customer’s Solr was straining to the breaking point and we tracked it down to a set of circumstances where Sitecore was using a default value for ContentSearch.SearchMaxResults (defaults to “” which is int.MaxValue which is 2,147,483,647) and flooding Solr with essentially unbounded queries. That default is downright dangerous. The query logs showed the queries using a rows value of int.MaxValue in rapid succession:
  • INFO Solr Query – ?q=associated_docs:(“\*A6C71A21-47B5-156E-FBD1-B0E5EFED4D33\*”)&rows=2147483647&fq=_indexname:(domain_index_web)
    INFO Solr Query – ?q=((_fullpath:(\/sitecore/content/Branches/XYZ/*) AND _templates:(d0321826b8cd4f57ac05816471ba3ebc)))&rows=2147483647&fq=_indexname:(domain_index_web)
  • Solr will set aside some memory for the 2,147,483,647 results even if the dataset isn’t that large. I discussed a scenario like this in detail in this earlier post from 2018
  • This write-up on Solr and “Large Number of Rows” speaks exactly to this scenario: https://risdenk.github.io/2018/10/21/apache-solr-out-of-memory-symptoms-and-solutions.html

Case 5 – The case of the bandwidth blowout

  • Network bandwidth usage was off the charts for the customer we considered in this scenario. It took some digging, but we discovered it was due to a 23 GB Solr core being replicated across data centers. If one viewed the replication panel in the Solr UI, one could see the slow creep of the replication progress bar and it would never reach 100% complete before starting over.

shen5

    • There was additional supporting material such as Solr WARN messages etc:

shen6

    • The network latency was too much for Solr master/slave replication to complete it’s work, but Solr kept on trying to move that 23 GB Solr core across the planet . . . and since this was the sitecore_analytics_index it was kept very busy by Sitecore. It all made for a feedback loop of frequent updates to the analytics index that couldn’t properly synchronize between data centers.
    • For this particular scenario, we determined that there wasn’t a need to replicate the sitecore_analytics_index (it was consumed only by the CM environment which didn’t require the geographical scaling through Solr replication). We disabled the master/slave replication for that specific Solr core and the tidal wave of network traffic stopped. Case closed!

Case 6 – The case of the misguided, well-intentioned, administrator

  • This scenario introduces a server administrator into the equation, and they actually cause more harm than good. The “optimize now” button in the Solr UI lured this administrator into clicking it without understanding the consequences:

OptimizeNow.JPG

  • I posted about this in detail last year, so I won’t dig too far into it here, but the gist of this scenario illustrates how Solr internally organizes files and that there are questionable UI choices for that Optimize Now button. It makes it look like an easy way for one to improve Solr performance when — in reality — clicking that Optimize Now can be pretty expensive in terms of perf, especially for a volatile Solr core.

Case 7 – The case of the AppPool recycle-fest

  • This scenario is one from a couple years ago, but it’s still relevant as a cautionary tale. For a long time, if Sitecore lost an active connection to Solr, the only option was to recycle the Sitecore AppPool. For a Solr server restart, or service restart, or even a transient network failure . . . Sitecore would need to run through the application initialization logic to reacquire Solr connectivity. In this specific case, there was a recurring network issue that interfered with Sitecore’s connectivity to Solr, so the customer scheduled IIS AppPool recycles every 15 minutes to ensure a fresh connection to Solr was available. This AppPool recycle-fest has terrible consequences for website performance as the site is constantly spending time on recycles and the related pipeline of events.
  • This case highlights why there are now more elegant ways of handling this; I recently blogged about the IsSolrAliveAgent designed to solve this exact problem. There’s periodic logic to reconnect Sitecore with Solr now, and it’s important to appreciate why it’s there and — probably — why you may want to tune the default setting of every 10 minutes for your production environment.

That’s the 7 Shenanigans related to Sitecore and Solr from my talk earlier in October. It’s a fun paradigm for learning more about the overlaps of Sitecore and Solr and I hope it helps others to get more from their Sitecore + Solr technology stack!

The Game Is Afoot . . . Solr Shenanigans for Sitecore (Part 1)

I took the challenge of presenting at the Manchester, New Hampshire Sitecore User Group a few days ahead of the 2018 Sitecore Symposium. I say challenge because

  1. Delivering content that isn’t superseded by the Sitecore Symposium agenda can be difficult (all the good Commerce or Azure or DevOps material would be saved for Symposium week)
  2. I would be following Michael West and his showcase of the new Sitecore PowerShell Extensions version 5.0 module (curious that https://doc.sitecorepowershell.com/releases doesn’t list the 5.0 yet, but I’m sure it’s coming there too).

As I finalized my topic, I surveyed the work I’ve been up to recently and figured I could take my talk in one of two directions that would be generally absent from the Sitecore Symposium agenda: Sitecore and Solr, or The Heresy of Sitecore on AWS. While we are doing some interesting things around AWS with RDS, ElastiCache, that’s officially unsupported territory with Sitecore and not fully baked enough for me to present it as anything approaching a best practice — but check back with me in 6 months. So, I elected to give a talk on Solr and explore some of the lessons learned from years in the trenches making Sitecore successful with Solr; the topic was finalized as Solr, Sitecore 9, 7 Shenanigans:

shenanigans
The title slide from the talk

I covered some of the history and underpinnings of Solr with regards to Sitecore, the dependence on Solr.Net (which is <important>NOT</important> a port of Solr to .Net the way Lucene.Net IS a port of Java’s Lucene — and why we should care), and common architecture patterns for Sitecore integrations based on Solr master/slave and Solr Cloud. I guess I’ve blogged a lot about Solr over the years; for instance, here and here are a couple sample areas I delved into.

I think the most fun part of the talk was the Shenanigans, however, as I went with a Sherlock Holmes theme to frame the conversation. I reviewed 7 cases and we had fun digging into some of the diagnostic bits.

shenanigans2

Here’s a quick run down of the first 3 Shenanigans:

  1. The case of the disappearing Java
    • Where we started with Solr that wouldn’t start for a set of Production servers  . . .

shen1.JPG

    • . . . and eventually solved the case by determining the development team installed a Java SDK with auto-update enabled and the system had removed the Java identified in the ClassPath. This is a brutal one for a Production implementation!
  1. The case of the underachieving mega-server
    • These OutOfMemoryErrors are not fun and can be due to a variety of issues:
      • shen2.JPG
    • In this case, we determined this 32 GB server was running Solr with the default 512 MB of memory set aside for Solr . . . a pretty fundamental issue:3.512 MB - default of 32 GB
    • We tuned the Solr start settings to use more of the server capacity for Java and Solr, in this case I think we used 10 GB as a starting point, and solved this specific case. This isn’t the last we’ll hear about OutOfMemory errors in our Shenanigans, however, as there can be many causes (see this great summary published just yesterday, for example).
  2. The case of the disappearing content
    • This scenario had a public-facing website’s content disappear periodically after content publishes . . . sound familiar to anyone? It was due to the threshold for full rebuilding of the search indexes after a Sitecore publish set low enough to trigger regularly
      • For the record, the setting is ContentSearch.FullRebuildItemCountThreshold and the default is 100,000 (0x186a0 – 100,000):Shen3

    • This customer didn’t have SwitchOnRebuild implemented for these key public-facing search indexes, so the first step of the index rebuild logic was to remove all documents from the Solr collection, then add the documents back in as the indexing process ran it’s course.  To the site visitor it created missing or inconsistent search results while the rebuilding took place, and for a large set of items it can take 60 minutes or longer for rebuilding.
    • The solution is to use the SwitchOnRebuild implementation for their Sitecore search indexes – https://doc.sitecore.net/sitecore_experience_platform/setting_up_and_maintaining/search_and_indexing/indexing/switch_solr_indexes and related documentation from Sitecore covers this process.

I will cover the remaining four Solr + Sitecore Shenanigans in my next post; here’s a teaser for the topics:

  • Case #4 – The case of the default query crippler
  • Case #5 – The case of the bandwidth blowout
  • Case #6 – The case of the misguided, well-intentioned, administrator
  • Case #7 – The case of the AppPool recycle-fest

As Sherlock Holmes would say: “The game is afoot!”