Sitecore and TTL Index Heresy for MongoDB

Background

Part of our role at Rackspace is to be pro-active in tuning and optimizing Sitecore implementations for our customers.  Sometimes, the “optimizations” come more in terms of cost-savings instead of performance benefits, and sometimes there is a correlation.

Here’s an optimization that is more oriented to cost-savings.  We have customers using MongoDB for both private and shared session state in Sitecore, as well as xDB data collections; they’re using a hosted MongoDB service through ObjectRocket which provides provisioning elasticity, good performance, and access to really strong MongoDB pros.  We noticed old HTTP session data accumulating in the session collections for one customer, and couldn’t find an obvious explanation.  These session records shouldn’t be leaking through the MongoDB clean-up logic that’s part of the Sitecore API (Sitecore.SessionProvider.MongoDB.MongoSessionStateProvider), but we wanted to correct it and understand why.

The catch was, the customer’s development had several higher priority issues they were working on, and sifting through a root cause analysis of something like this could take some time.  In the interim, we addressed how to clean out the old session data from MongoDB — if the customer was using SQL Server for sessions state, a SQL Server agent deleting expired sessions could be easily employed . . . but that isn’t an option for MongoDB.

Research

At Rackspace, we began evaluating options for a MongoDB equivalent to the “clean up old session records” process and so I started by reviewing everything I could find about data retention for MongoDB and Sitecore.  It turns out there isn’t much.  There isn’t mention of data clean-up in the Sitecore documentation on shared and private session for MongoDB.  There isn’t an explicit <agent> configured anywhere, either.  With Reflector, I started looking through the Sitecore.SessionProvider.MongoDB.MongoSessionStateProvider class and I didn’t see any logic related to data clean-up.

I did finally find some success, however, in reviewing the Sitecore.SessionProvider.MongoDB.MongoSessionStateProvider class that extends the abstract Sitecore.SessionProvider.SitecoreSessionStateStoreProvider class.

The abstract class uses a timer (controlled by the pollingInterval set for the provider); it runs OnProcessExpiredItems . . . and then GetExpiredItemExclusive . . . and eventually RemoveItem for that record.  That finally calls through to the MongoDB API with the delete query:

RemoveItemThis was all helpful information to the broader team working to determine why the records were remaining in MongoDB, but we needed a quick non-invasive solution.

TTL Indexes

MongoDB supports “Time to Live (TTL) Indexes” which purge data based on time rules.  Data older than 1 week, for example, could automatically be removed with this index type.  That’s surely acceptable for sessions state records that leak through the Sitecore API.  We coordinated with the ObjectRocket team and are setting the customer up with TTL Indexes instead of the default; this should dramatically reduce the data storage footprint.

Heresy

While pursuing this effort on behalf of session state management, I realized this could be an intriguing solution to data retention challenges with Sitecore’s xDB.  Using MongoDB TTL indexes for the xDB collections would prevent that data from growing out of control.  Set a TTL value of 180 days, for example, and make use of just the most recent 6 months of user activity as part of content personalization, profiling, etc.  Of course, one sacrifices the value of the old data if one expires it after a set time.  Remember, I’m acknowledging this is heresy! 🙂

I really wonder, though, how many Sitecore implementations intend to store all the xDB data indefinitely and have a strategy for making use of all that old, ever-accumulating, data?  I think the promise of xDB analytics-based presentation rules is far greater than the reality.  I see many organizations lacking a cohesive plan for what data to gather into xDB, and how to effectively make use of it.

I think TTL Indexes for MongoDB would be a good safety net for those companies still sorting out their path with xDB and analytics, without having to bank massive volumes of data in MongoDB during the maturation process.

One final note: since conserving disk space is a priority, performing a weekly data compaction for MongoDB is a good idea.  This fully reclaims the space for all the expired documents.