A typical DR service works by replicating application state between two data centers. If the primary data center becomes unavailable, then the backup site can take over and will activate a new copy of the application using the most recently replicated data.

The key requirements for an effective DR service are based on business decisions such as the monetary cost of system downtime or data loss, while others are directly tied to application performance and correctness.

Disaster Recovery is primarily a form of long-distance state replication combined with the ability to start up applications at the backup site after a failure is detected. The amount and type of state that is sent to the backup site can vary depending on the application's needs. State replication can be done at one of these layers:
(i) within an application
(ii) per disk or within a file system, or
(iii) for the full system context.

Failover and Failback
In addition to managing state replication, a DR solution must be able to detect when a disaster has occurred, perform a failover procedure to activate the backup site, as well as run the failback steps necessary to revert control back to the primary data center once the disaster has been dealt with. Detecting when a disaster has occurred is a challenging problem as transient failures or network segmentation can trigger false alarms. In practice, most DR techniques rely on manual detection and failover mechanisms.

In most cases, a disaster will eventually pass and a business will want to revert to the control of its applications back to the original site. To do this, the DR software must support bidirectional state replication so that any new data that was created at the backup site during the disaster can be transferred back to the primary. However, this can be a major challenge: the primary site may have lost an arbitrary amount of data due to the disaster, so the replication software must be able to determine what new and old state must be resynchronized to the original site. In addition, the failback procedure must be scheduled and implemented in order to minimize the level of application downtime.

