Sitecore and Solr Diagnostics: Pings and Logs

Like I mentioned last week in this post about Sitecore + Solr – Lucene, I wanted to share some other posts about running Sitecore with Solr.  While 2017 is shaping up to be the year of Azure for Sitecore, I also expect Solr to continue to grow in prominence across Sitecore implementations.

Solr exposes lots of information about it’s current condition and state, but it’s not that easy to access.  I’ll take a run at shedding light on some of the monitoring basics and let’s see where this goes:

Ping

A proof-of-life for Solr is usually part of an implementation, whether it’s for a load-balancer in front of Solr nodes or some other form of a health check.  This is typically done via the Solr /admin/ping endpoint so a request to http://localhost:8983/solr/sitecore_master_index/admin/ping?wt=json would result in something like:

{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "q": "{!lucene}*:*",
      "distrib": "false",
      "df": "text",
      "rows": "10",
      "wt": "json",
      "echoParams": "all"
    }
  },
  "status": "OK"
}

I chose the JSON response writer for Solr (more about this at https://wiki.apache.org/solr/CoreQueryParameters).  Solr’s default response writer is XML and returns a lot more formatting text around the exact same information.

There is actually a lot of flexibility around this ping RequestHandler for Solr (check out the documentation).  I’m scratching the surface here since, for a plain vanilla Sitecore install, knowing that ping will either be “OK” or return the failing HTTP error code is usually sufficient.

Logging

So the ping is easy, but only reports OK or NOT OK.  It doesn’t provide any real granularity around that determination . . . so let’s assume you’ve got a Solr server that is not returning an “OK” status from the ping.  If you’ve restarted Solr on the server, and Solr still isn’t responding “OK” then it’s probably time to dig into the Solr logs.  Solr defaults to using a variation of Log4j to capture trace logs, so the documentation on Log4j can be helpful (just note that currently Log4j is on version 2.x, while the Solr versions I’ve worked on [up to Solr 6.2] are all using the earlier Log4j 1.2 version [log4j-1.2.17.jar]).

Solr will use the file server\resources\log4j.properties as the basis for Log4j configuration, but this can vary depending on how you’ve installed Solr.   If you need to determine where this file is configured in your implementation, you can check the Dashboard of the Solr web admin tool.  This shows the runtime configuration settings for Solr.  I’ve highlighted the two log related settings in my example:

solrlogsettings

The -Xloggc parameter is for Java Garbage Collection (GC) logging and usually not something you would consider Solr specific, but good to know it exists if that becomes a topic to investigate.

Inside a standard log4j.properties file one finds the following:

#  Logging level
solr.log=logs
log4j.rootLogger=INFO, file

log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender

log4j.appender.CONSOLE.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%-4r %-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m%n

#- size rotation with log cleanup.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.MaxFileSize=50MB
log4j.appender.file.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.file.File=${solr.log}/solr.log
log4j.appender.file.layout=org.apache.log4j.EnhancedPatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p (%t) [%X{collection} %X{shard} %X{replica} %X{core}] %c{1.} %m\n

log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.apache.hadoop=WARN

# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=OFF

That configuration file is fairly self-explanatory.  Adjust the rootLogger value off of INFO if you’d prefer DEBUG or WARN or some other level.  I think I have just one other note on the above configuration file:

the MaxFileSize setting of older Solr installations was set to 4 MB; this is pretty small for a production instance, so the example above (using a Solr 6 default) of using 50 MB makes more sense for production.

If one peeks into a solr log file, there’s a couple common record types by default.  Here’s what a query logged for Solr looks like:

webapp=/solr path=/select params={df=text&distrib=false&fl=_uniqueid&fl=score&shards.purpose=4&start=0&fsv=true&fq=_indexname:(sitecore_testing_index)&shard.url=http://10.225.240.243:8983/solr/sitecore_testing_index_shard2_replica2/|http://10.225.240.72:8983/solr/sitecore_testing_index_shard2_replica1/&rows=2147483647&version=2&q=__is_running_b:(True)&NOW=1481640759343&isShard=true&wt=javabin} hits=0 status=0 QTime=0

After the query parameters, the log reports the number of hits for the query, the status (0 means “OK”) and the duration for this request in milliseconds.  Note that Solr queries are logged after they complete . . . so if you have a run-away query you may not see it in the log with the standard configuration.  More about this at a later time, perhaps.

Updates to Solr are logged using a similar pattern, they will start with webapp=/solr path=/update.

It’s possible to  setup Log4j to create more narrowly targeted logs.  For example, you could have queries and updates written to different logs for easier troubleshooting (see this example for the nitty gritty on targeting Solr logs).

If your project doesn’t want all this “noise” about each update and query to Solr, or there’s a concern that all this logging could become a bottleneck, you can set the log level for Solr to WARN and only see warnings and errors in your logs.  You will be sacrificing that query performance data, however, so you’ll not have a record to review for slow performing queries . . . so a recommended practice is to trigger a “WARN” condition when a query exceeds a certain threshold.  In solrconfig.xml add a <slowQueryThresholdMillis> element to the query section of solrconfig.xml:

<query>
<slowQueryThresholdMillis>2000</slowQueryThresholdMillis>

Any queries that take longer than 2 seconds (2000 milliseconds defined above) will be logged as “slow” queries at the WARN level.  Note that this setting would need to be duplicated for every core that’s defined in Solr (meaning, with Sitecore’s one core per index  approach you’ll have several solrconfig.xml files to touch).

Besides these fairly routine entries in Solr logs, there will also be exceptions and sometimes using the Logging section of the Solr web admin provides a convenient way to see the most recent set of these exceptions.  This is a convenient way of viewing Log4j items for your Solr implementation in near real-time; note if you make changes to the Level setting from this web interface, it is not persistent after a Solr restart.

Example of a Solr logging section in the web admin interface:

loggingsolradmin

There are a number of other ways to review the Log4j data, old-school Chainsaw is still around, some I know some prefer Otros log viewer, but I’ve had success with Splunk (see screen shot below):

splunkforSolr.JPG

I’ll get into other methods for diagnostic work with Solr in the next post.

Sitecore Gets Serious About Search, by Leaving the Game

Sitecore is steadily reducing the use-cases for using Lucene as the search provider for the Sitecore product.  Sitecore published this document on “Solr or Lucene” a few days ago, going so far as to state this one criteria where you must use Solr:

  • You have two or more content delivery servers

Read that bullet point again.  If you’ve worked with Sitecore for a while, this should trigger an alarm.

Sound the alarm: most implementations have 2 or more CD servers!  I’d say 80% or more of Sitecore implementations are using more than a single CD server, in fact.  Carrying this logic forward, 80% of Sitecore implementations should not be running on Lucene!  This is a big departure for Sitecore as a company, who would historically dodge conversations about what is the right technology for a given situation.  I think the Sitecore philosophy was that advocating for one technical approach over another is risky and means a Sitecore endorsement could make them accountable if a technology approach fails for one reason or another.  For as long as I’ve known Sitecore, it’s a software company intent on selling licenses, not dictating how to use the product.  Risk aversion has kept them from getting really down in the weeds with customers.  This tide has been turning, however, with things like Helix and now this more aggressive messaging about the limitations of Lucene.  I think it’s great for Sitecore to be more vocal about best practices, it’s just taken years for them to come around to the idea.

As a bit of a search geek, I want to state for the record that this new Solr over Lucene guidance from Sitecore is not really an indictment of Lucene.  The Apache Lucene project, and it’s cousin the .Net port Lucene.net that Sitecore makes use of out-of-the-box, was ground breaking in many ways.  As a technology, Lucene can handle enormous volumes of documents and performs really well.  Solr is making use of Lucene underneath it all, anyway!  This recent announcement from Sitecore is more acknowledgement that Sitecore’s event plumbing is no substitute for Solr’s CAP-theorem straddling acrobatics.  Sitecore is done trying to roll their own distributed search system.  I think this is Sitecore announcing that they’re tired of patching the EventQueue, troubleshooting search index update strategies for enterprise installations, and giving up on ensuring clients don’t hijack the Sitecore heartbeat thread and block search indexing with some custom boondoggle in the publishing pipeline.  They’re saying: we give up — just use Solr.

Amen.  I’m fine with that.

To be honest, I think this change of heart can also be explained by the predominant role Azure Search plays in the newest Sitecore PaaS offering.  Having an IP address for all the “search stuff” is awful nice whether you’re running in Azure or anywhere else.  It’s clear Sitecore decided they weren’t keen to recreate the search wheel a few years ago, and are steadily converging around these technologies with either huge corporate backing (Azure Search) or a vibrant open source community (Solr).

I should also note, here, that Coveo for Sitecore probably welcomes this opportunity for Sitecore teams to break away from the shackles of the local Lucene index.  I’m not convinced that long-term Coveo can out-run the likes of Azure Search or Solr, but I know today if your focus is quickly ramping up a search heavy site on Sitecore you should certainly weight the pros/cons of Coveo.

With all this said, I want to take my next few posts and dig really deep into Solr for Sitecore and talk about performance monitoring, tuning, and lots of areas that get overlooked when it comes to search with Sitecore and Solr.   So I’ll have more to say on this topic in the next few days!