Jun. 16, 2020

Dipsy Kapoor

|

6 min. read

Solr Disaster recovery with CDCR can provide nearly instantaneous synchronization of your data and reduce RPO, a key disaster recovery metric, to minutes. See how disaster recovery or DR is improved using CDCR and get the full story on the benefits, use cases and limitations.

How is Disaster Recovery Improved Using CDCR?

Cross Data Center Replication or CDCR is the ability to replicate data from a Source data center to one or more Target data centers.

The concept of Recovery Point Objective or RPO is a key measure for disaster recovery. RPO defines the amount of data that your business can tolerate losing in case of an emergency — and determines how frequently you need to back up your data and/or synchronize it across your infrastructure. 

When using CDCR for disaster recovery, the synchronization is nearly instantaneous and the RPO is measured in at most a few minutes. From a Solr perspective, both the Source and Target data centers can serve search queries when CDCR is operating and the risk associated with any loss of data or from an outage is minimal.

What Is the Difference Between Basic Disaster Recovery and Disaster Recovery with CDCR?

At SearchStax, our basic disaster recovery option uses the backup and restore mechanism. This means that the backup operation creates a snapshot of the deployment, config data along with metadata in Zookeeper, any custom JARs and stores it as a bundle. The backup bundle is then moved to the Target cluster, a separate deployment in a different location than the Source cluster. When the bundle is restored on the Target cluster, we would then have mirrored the collections, config files and metadata exactly as it was on the Source.

During the process, the operation copies the entire Source collection’s data and unbundles it on the Target periodically. This process has several issues:

  • Becomes unwieldy as the number of collections and the data size grows
  • Consumes more resources because you have to back up the all of the data every time as opposed to just the delta changes from the last backup point
  • During the brief period when the source data is restored, the secondary deployment is unavailable for a brief period
  • The resulting RPO could be up to a few hours depending on the data size

The end result of this DR cluster approach is an outage on the secondary cluster of typically 5-10 minutes (but it could be more based on the data size) from the synchronization process.

On the other hand, disaster recovery with CDCR offers an alternative hassle-free streamlined replication of your data. One way to think of DR with CDCR is to imagine an open pipe from the Source to the Target files where any update on the Source flows automatically to the Target. DR with CDCR is designed to be robust even with network partitions and node failures. This is accomplished by tracking exactly which updates have been persisted to each node in the system, and retrying updates that have previously failed until they are successfully transmitted.

There are two ways that CDCR can be setup :

  • Uni-Directional – Uni-directional CDCR flows in only one direction, from the Source cluster to a Target cluster. When uni-directional updates are configured, updates and deletes are first written to the Source cluster, then forwarded to one or more Target data centers
  • Bi-Directional – Bi-Directional CDCR starts with two clusters, one cluster as the Source and the other as a Target. With bi-directional updates, indexing and querying must be done on a single cluster at a time to maintain consistency. The second cluster is only used when the first cluster is down. In simpler terms, one cluster can act as Source and other as Target but both roles, Source and Target, cannot be assigned to any single cluster at the same time. 

Benefits of Using Disaster Recovery with CDCR

The biggest benefit of using CDCR for disaster recovery is that it provides a cleaner, near real-time replication of data between different clusters.

Another benefit of CDCR for disaster recovery is that it can reduce bandwidth and is designed to tolerate some degradation in connectivity or limited bandwidth situations. CDCR also supports batch updates, so communication channels can be optimized.

Since there is no downtime of the disaster recovery cluster at any time, CDCR provides a better, more highly available disaster recovery solution and RPO will be measured in a few minutes or less.

Use Cases for DR with CDCR

CDCR is useful when the business needs for RPO are very small for Disaster Recovery. However, in addition to faster RPO times, CDCR can be useful when Solr data needs to be synced across deployments that cover a broad Content Delivery Network and serve local traffic.

Customers could have multiple end-application services or installations across the globe to provide faster service. For example, a global provider would have an application instance running in the US, one in Europe and one in Asia Pacific. In order to provide smaller response times and faster service, Solr could be deployed in each of these regions, so they talk to the local application instances.

The Solr deployments in these regions could be Synced using CDCR. CDCR is not limited to the number of deployments that it can Sync with, and can be used to Sync data across multiple deployments.

Disaster recovery with CDCR within SearchStax Service could be setup as,

  • Hot Disaster Recovery – Hot DR uses the same specification as the main deployment for the secondary deployment in a different region. It is a fully scaled replica of the primary, and can handle the same load as the primary in case of a failover.
  • Warm Discovery Recovery – Warm DR uses a single node version of the primary deployment for the secondary deployment in a different region. It provides the same RPO and RTO as a Hot DR, but in case of a failover, the throughput is lesser and you will get a performance degradation.

Limitations of Disaster Recovery with CDCR

There is no perfect solution for recovering from disasters. In terms of cost and recovery time, there will always be some trade-offs. For example, CDCR requires less data bandwidth since the synchronization is incremental, but every change of schema will need to be updated manually by the SearchStax support team .

While the limitations of CDCR are relatively minor, it is important to understand how they may impact your business requirements:

  • CDCR needs to be set up on a per collection basis – This means that any time a new collection is added, SearchStax needs to be contacted to add synchronization for the new collection.
  • CDCR only synchronizes data – changes to the configs, aliases or jars on the deployments need to be manually synched 
  • Current CDCR implementations do not support Basic Authentication – The current CDCR Implementation by Apache Solr community does not work if Basic Authentication is enabled. This holds true till the current Solr version – Solr 8.11.
  • Backups are still important – CDCR does not eliminate the need for regular backups. Someone could still accidentally delete an index or something could corrupt your files, so a regular backup routine should still be followed.
  • SearchStax only supports CDCR starting with Solr version 7.2.1 – Although SOLR 6.6 features CDCR, the functionality lacks bi-directional and bootstrapping support. Due to these reasons, SearchStax has decided to start supporting CDCR starting Solr version 7.2.1.

Dipsy Kapoor, VP Engineering and Vignesh Kumar, Software Engineer, co-authored this blog post.

For more information on SearchStax disaster recovery options, go to Disaster Recovery Options for Solr Deployments.

By Dipsy Kapoor

VP, Engineering

“…search should not only be for those organizations with massive search budgets.”

Get the Latest Content First