Strategies for Sitecore Index Organization into Solr Cores

A few days ago, I shared a graphic I put together to illustrate how Solr can be used to organize Sitecore “indexes” into Solr “cores” — this post has the complete graphic.  I want to elaborate on how one sets Sitecore up to use these two approaches, and dig further into the details.

1:1 Sitecore Index to Solr Core Strategy

To start, here’s a visual showing the typical way Sitecore “indexes” are structured in Solr using a one-to-one (1:1) mapping:

solrseparate

This shows each of the default search indexes defined by Sitecore organized into their own cores defined in Solr.  It’s a 1:1 mapping.  This 1:1 strategy means each index has their own configuration (“conf”) directory in Solr, so seperate stopwords.txt, solrconfig.xml, schema.xml, and so on; it also means each index has their own (“data”) directory in Solr, so separate tlog folders, separate Segment files, etc.

This is the setup one achieves by following the community documentation on setting up Sitecore with Solr; specifically, this quote from that write-up is where you’re doing a lot of the grunt work around setting up distinct Solr cores for each Sitecore index:

“Use the process detailed in Steps 4-7 to create new cores for all the remaining indexes you would like to move to SOLR.”

Since this is the common strategy, I’m not going to go into more details as it’s straight-forward to Sitecore teams.

Kitchen Sink (∞:1 Sitecore Index to Solr Core) Strategy

Here is the comparable graphic showing the ∞:1 strategy of structuring Sitecore indexes in Solr; I like to think of this as the Kitchen Sink container for all Sitecore indexes, since everything goes into that single core just like the kitchen sink:

solrsame

With this approach, a single data and configuration definition is shared by all the Sitecore indexes that reside in Solr.  The advantages are reduced management (setting up the Solr replicationHandler, for example, requires updating 15 solrconfig.xml files in the 1:1 approach, but the Kitchen Sink would require only one solrconfig.xml file to update).  There are significant drawbacks to consider with the Kitchen Sink, however, as you’re sacrificing scaling options specific to each Sitecore index and enforcing a common schema.xml for every index stored in this single core.  There are plenty of reasons not to do this for a production installation of Sitecore, but for a crowded Sitecore environment used for acceptance testing or other use-cases where bullet-proof stability and lots of flexibility when it comes to performance tuning, sharding, etc is not necessary, you could make a good case for the Kitchen Sink strategy.

The only change necessary to a standard Sitecore configuration to support this Kitchen Sink approach is to patch the contentSearch definitions for the Sitecore indexes where the name of the Solr “core” is specified (stored by default in config files like Sitecore.ContentSearch.Solr.Index.Master.config,  Sitecore.ContentSearch.Solr.Index.Web.config, etc).   This is telling Sitecore which Solr core contains the index, but the actual name of the core doesn’t factor into the ContentSearch API code one uses with Sitecore.   A patch such as the following would handle both the sitecore_master_index and the sitecore_web_index to organize into a Solr Core named “kitchen_sink:”

<configuration xmlns:patch="http://www.sitecore.net/xmlconfig/">
  <sitecore>
    <contentSearch>
      <configuration>
        <indexes>
          <index id="sitecore_master_index" type="Sitecore.ContentSearch.SolrProvider.SolrSearchIndex, Sitecore.ContentSearch.SolrProvider">
            <param desc="core">kitchen_sink</param>
          </index>
          <index id="sitecore_web_index" type="Sitecore.ContentSearch.SolrProvider.SolrSearchIndex, Sitecore.ContentSearch.SolrProvider">
            <param desc="core">kitchen_sink</param>
          </index>
        </indexes>
        </configuration>
    </contentSearch>
  </sitecore>
</configuration>

If you peek into the Solr Admin for the kitchen_sink core that I’m using, specifically the Schema Browser in the Solr Admin UI, it becomes clear how Sitecore uses a field named “_indexname” to represent the Sitecore index value.  For this screenshot below, I’ve set the kitchen_sink core to contain two Sitecore indexes: sitecore_master_index and sitecore_web index:

solrterms

This shows us the two terms stored in that _indexname field, and that there are 18,774 for sitecore_master_index and 5,851 for sitecore_web_index.  Even though the indexes are contained in the same Solr Core, Sitecore ContentSearch API code like this . . .

Sitecore.ContentSearch.ISearchIndex index = 
  ContentSearchManager.GetIndex(indexName);
    using (Sitecore.ContentSearch.IProviderSearchContext ctx = 
      index.CreateSearchContext())

. . . doesn’t care whether all the Sitecore indexes reside in a single Solr “Core” or if they’re in their own following a 1:1 mapping strategy.

Caveats and Going In A Different Direction

There was a bug or two in earlier versions of Sitecore related to this, so be careful with early Sitecore 7.2 or Sitecore 8 implementations (and if you’re using Sitecore 7.5, you’ve got plenty of other things to worry about so don’t sweat a Solr Core organization strategy!).

I should also note that while this post is looking at combining Sitecore indexes into a single Solr Core for convenience and to reduce the management headaches of having 15 sets of Solr Cores to update etc, there are some implementations that go in the opposite direction.  Consider a strategy like the following:

solrmindblown

 

There may be circumstances where keeping Sitecore indexes in their own Solr Core — and even isolating them further into their own Solr implementation — could be in order.  Solr runs in a JVM and this could certainly factor in, but there are other shared run-time resources that Solr sets aside for the whole Solr application.

I’m not familiar enough with these sorts of implementations that I want to comment further or recommend any course of action related to this right now, but it’s good to think about and consider with Solr tuning scenarios.  I just wanted to share it, as it’s a logical dimension to consider given the two previous strategies in this post.

 

The Solr *Optimize Now* Button for Sitecore Use Cases

If you’ve worked with Sitecore and Solr, you’re no stranger to the Solr Admin UI.  There are great aspects to that UI, and some exciting extension points with Sitecore implications too, but I know one element of that Solr UI that causes some head-scratching . . . the “optimize now” feature:

OptimizeNow.JPG

 

The inclusion of the badcauses people to think “something is wrong . . . red bad . . . must click now!”

What is this Optimize?

I’m writing this for the benefit of Sitecore developers who may not be coming at this from a deep search background: do not worry if your Solr cores show this badicon encouraging you to optimize now.  Fight that instinct.  For standard Sitecore Solr cores that are frequently updating, such as the sitecore_core_index, sitecore_master_index, sitecore_analytics_index, and — depending on your publishing strategy — sitecore_web_index, one may notice these cores almost always appear with this “optimize now” button in the Solr Admin UI.  Other indexes, too, may be heavily in use depending on how your Sitecore implementation is structured.  If you choose the optimize now option and then reload the screen, you’ll see the friendly green check mark next to Optimized and you should notice the Segment Count drops to a value of 1:

segmentcount1.JPG

Segments are Lucene’s (and therefore Solr’s) file system unit.  On disk, “segments” are where data is durably stored and organized for Lucene.  In Solr version 5 and newer, one can visualize Segment details for each Solr Core via the Solr Admin UI Segments Info screen.  This shows 2 Segments:

segmentinfo

If your Segment count is greater than 1, the Solr Admin UI will report that your Solr Core is in need of Optimization (with that somewhat alarmingbadicon).  The Optimize operation re-organizes all the Segments in a Core down to just one single Segment . . . and for busy Sitecore indexes this is not something to do very often (or at all!).

To track an optimize operation through at the file system level, consider this snapshot of the /data/index directory for a sitecore_master_index before performing optimization; note the quantity of files:optimizebefore

After the optimization, consider the same file system:

optimizeafter

When in doubt, don’t optimize

Solr’s optimize now command is like cleaning up a house after a party.  It reduces the clutter and consolidates the representation of the Solr Core on disk to a minimal footprint.  The problem, is, however, optimizing takes longer the larger the index is — so the act of optimizing may produce very non-optimal performance while it’s doing the work.  Solr has to read a copy of the entire index and restructure the copy into a single Segment.  This can be slow.  Caches must be re-populated after an optimization, too, compounding the perf impact.  To continue the analogy of the optimize now being like cleaning after a party, imagine cleaning up during a party; maybe you pause the music and ask everyone to leave the house for 20 minutes while you straighten everything up.  Then everyone returns and the partying resumes, with the cleaning being a mostly useless delay.

To draw from the official Solr documentation at https://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations:

“Optimizing is very expensive, and if the index is constantly changing, the slight performance boost will not last long. The trade-off is not often worth it for a non static index.”

For those Sitecore indexes in Solr that are decidedly non-static, then, ignore that “optimize now” feature of the Solr Admin UI.  It’s better to pay attention to Solr “Merge Policies” for a rules based approach to maintaining Segments; this is a huge topic, one left for another time.

When to consider optimizing

Knowing more about the optimization process in Solr, then, we can think about when it may be appropriate to apply the optimize command.  For external data one is pulling into Solr, for example, a routine optimization could make sense.  If you have a weekly product data load, for instance, where 10,000 items are regularly loaded into a Solr Core and then they remain un-changed, optimization after the load completes makes a lot of sense.  That data in the Core is not dynamic.  When the data load completes, you could include an API call to Solr that triggers the optimize.

An API call to trigger an optimize in Solr is available through an update handler call : http://solr-server:8983/solr/sitecore_product_catalog/update?stream.body=<optimize><query>*:*</query></optimize&gt;

Sitecore search has a very checkered past with the Lucene Optimize operation.  I’ve worked on big projects that were crippled by too frequent optimizing work like that discussed in Uli Weltersbach’s post.  We ended up customizing the Optimize methods to be no-op statements, or another variation like that.  For additional validation, check out the Lucene docs on the optimize method:

“This method has been deprecated, as it is horribly inefficient and very rarely justified.”

Since Solr sits on top of Lucene, the heritage of Lucene’s optimize is still relevant and — in the Solr Admin UI — we see a potential performance bottleneck button ripe for clicking . . . fight that instinct!

now

 

Solr Configuration for Integration with Sitecore

I’ve got a few good Solr and Sitecore blogs around 75% finished, but I’ve been too busy lately to focus on finishing them.  In the meantime, I figure a picture can be worth 1,000 words sometimes so let me post this visual representation of Solr strategies for Sitecore integrations.  One Solr core per index is certainly the best practice for production Sitecore implementations, but now that Solr support has significantly matured at Sitecore a one Solr core for all the Sitecore indexes is a viable, if limited, option:

draft

There used to be a bug (or two?) that made this single Solr core for every Sitecore index unstable, but that’s been corrected for some time now.

More to follow!