A Solr administrator can make a single-character configuration error that issues no warning but causes delayed failures as the project matures. These issues are not easily diagnosed from their symptoms. One of these sensitive areas has to do with misconfigured Solr “replicas.” This post will examine the most common reasons behind replica issues and how to fix them.
A Solr “replica” is a complete copy of one index. It is a very simple thing. Every Solr collection has at least one replica. (See What is a collection/core/shard/replica? for more information on Solr index concepts.)
Most SearchStax clients develop and test their projects on single-server deployments for reasons of economy. The server has one copy of the index (one replica). Production systems, however, usually have 2-3 servers, and can scale up to many servers. Each server has its own replica of the index.
In normal operation, replicas support the following behaviors:
- Zookeeper monitors the replicas, keeping index files in sync across the cluster.
- A load balancer distributes incoming queries. The servers can all respond to queries because they all have a copy of the index. Parallelism reduces query latency.
- When a server goes off-line (due to a failure or to a “rolling restart,”) the remaining servers handle the query load with no interruption.
Unfortunately, it is easy to create a Solr collection where there are fewer replicas than servers. For instance, people sometimes use the wrong replicationFactor setting when creating a collection. We have often seen three-node systems that had only one replica.
This situation creates the following issues:
- During a period of high query load, query latency may suffer because only one node is doing all the work.
- During a “rolling restart,” the system may “fail over” to a server that doesn’t have a copy of the index. This causes service interruption.
- Manual and scheduled backups depend on every node having a replica. If some nodes lack replicas, backups may fail.
- If some nodes lack a replica, Pulse may have difficulty monitoring that collection.
Replicas can also go into “recovery mode.” Solr administrators sometimes overload their systems by asking Solr to index too many records in a single batch. CPU levels max out at 100% for extended periods. This causes service outages as one replica after another goes into recovery mode for no visible reason.
Zookeeper checks the status of each replica every few minutes. When this process times out due to CPU overload, Zookeeper assumes that the replica’s server is down. It puts the replica into “recovery mode” while it plays back all recent changes to repair the replica. The replica is unavailable to the system until this process is complete.
Replica recovery places additional burdens on the node’s CPU, of course, which interferes with Zookeeper’s attempts to monitor other replicas on the same node. In a multi-collection system (such as a Sitecore index,) one replica after another goes into recovery. Cascading failures can bring down the whole cluster.
The immediate cure is to stop Solr and then restart it. This interrupts ingestion and gives Zookeeper a chance to catch up. The replicas quickly come back on line. To avoid this behavior, adjust the ingestion batch size to give Zookeeper adequate access to the CPU.
Best practice: For each Solr collection, a replica should be present on every node of the cluster.