Best practices tolerate failure

15min

The goal of a failure-tolerant system is to prevent disruptions from a single point of failure; ensuring the high availability and business continuity of your mission-critical applications or infrastructure. This best practice highlights principles for you to consider in regards to the design, implementation, and operation of your business systems for you to best achieve your reliability goals.

When you create a fault tolerant system; you can avoid expensive downtimes, ensure efficiency, and scale intelligently.

Principles and practices

The following principles provide guidance on how to the design, implementation, and operation your systems reliably.

Plan for resiliency and availability

As you plan for resiliency and availability, you must determine how robust your system architecture needs to be in terms of failure, degradation, and performance.

Below are some considerations:

Identify all applications and infrastructure where availability is critical.
Calculate the cost of your failure domain strategy.
Determine your uptime goals.
Compare your architecture and failure recovery plans to the business requirements (BCP).

Configure fault resiliency for your Consul Enterprise datacenter using redundancy zones. Redundancy zones make it possible to run one voter and any number of non-voters in each defined zone.

To protect your Vault deployment against catastrophic failure of an entire cluster. Vault Enterprise supports multi-datacenter deployment where you can replicate data across datacenters for performance as well as disaster recovery.

Distributed systems

When you deploy a distributed system, like Consul or Vault, one of the first considerations should be quorum. If a quorum of nodes is unavailable for any reason, the cluster becomes unavailable and no new logs can be committed.

Quorum is a majority of members from a peer set. For a set of size N, quorum requires at least (N/2)+1 members. For example, if there are 5 members in the peer set, you would need 3 nodes to form a quorum.

For most deployments, we recommend deploying three to five servers for both Vault and Consul deployments. This is also valid for any clouds platform, including HashiCorp Cloud Provider(HCP). For large deployments that need to scale reads without impacting write latency with too many voting servers in the quorum, we recommend the non-voting or read replication feature available in the Enterprise or HCP editions.

In instances of unexpected failure, both Consul and Vault can recover from quorum loss.

State management and disaster recovery

Disaster recovery considerations are an important part of a company's overall business continuity planning. Both Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO) should be considered.

At a minimum a recovery plan should include the following.

State and change management.
Immutable infrastructure for quick replicate and deployment.
Automated backups that are stored on mounted or external storage, instead of local or ephemeral storage.

The frequently of backups should align with the RTO and RPO set forth within each customer’s disaster recovery policies. Both Consul Enterprise and Vault Enterprise have automate backup features for the cluster. Additionally, you can automate the backups of your Terraform Enterprise deployment to ensure business continuity.

Zero-downtime deployments

Recover Terraform Enterprise