What is database failover, and how do systems achieve high availability?

6 minadvancedfailoverhigh-availabilityreplication

Quick Answer

Failover is the process of automatically detecting that a primary database has become unavailable and promoting a replica to take over as the new primary, minimizing downtime. High availability is achieved by combining replication (so a healthy up-to-date copy exists), health monitoring/heartbeats (to detect failure quickly), and an automated promotion/routing mechanism (so clients start talking to the new primary without manual intervention) — measured by uptime targets like "99.99%" and by recovery time/point objectives (RTO/RPO).

Detailed Answer

The failover sequence

1. Primary is healthy, replicating to Replica A and Replica B.
2. Primary crashes (hardware failure, network partition, etc.).
3. A monitoring/orchestration system detects the primary is unresponsive
   (via missed heartbeats over some threshold).
4. The most up-to-date, healthy replica (say, Replica A) is PROMOTED to primary.
5. Application connections / a proxy / DNS / a virtual IP are redirected
   to point at the newly-promoted Replica A.
6. Replica B is reconfigured to replicate from the new primary (Replica A).

Key metrics that define "how good" a failover strategy is

  • RTO (Recovery Time Objective) — how long the system is actually down/unavailable during a failover, from detection to the new primary accepting traffic. Automated failover systems can often achieve RTOs of seconds to low minutes; manual intervention can take much longer.
  • RPO (Recovery Point Objective) — how much data (measured in time) could be lost in the worst case. With synchronous replication, RPO can be effectively zero; with asynchronous replication, RPO is bounded by however far behind the promoted replica was at the moment of failure (see the synchronous vs. asynchronous replication question).

Detecting failure correctly is harder than it sounds

A naive health check (a single missed heartbeat) risks false positives — briefly failing over due to a transient network blip, not an actual primary failure — which is disruptive and risky in its own right (a "split-brain" scenario, where both the old primary, which actually recovers a moment later, and the newly-promoted replica both believe they're the primary, is a serious and hard-to-clean-up failure mode). Real HA systems use consensus mechanisms or a quorum of independent observers (not a single health checker) to confirm a primary is truly down before triggering promotion, specifically to avoid this.

Components of a full HA setup

  • Replication — at least one replica must be reasonably current to promote.
  • Health monitoring / consensus — reliably detects genuine failure without over-triggering on transient issues.
  • Automated promotion — a replica is reconfigured to accept writes as the new primary.
  • Client redirection — a proxy, load balancer, virtual IP, or DNS update routes traffic to the new primary without requiring every application instance to be manually reconfigured.
  • Re-establishing replication topology — surviving replicas need to start following the new primary, and (ideally) the old primary, if it recovers, needs to safely rejoin as a replica rather than as a conflicting second primary.

Managed services vs. self-managed

Cloud-managed database services (AWS RDS/Aurora, Azure SQL, Google Cloud SQL) handle most of this automatically as a built-in feature — often with RTOs in the tens of seconds. Self-managed HA (e.g., PostgreSQL with Patroni + etcd, or MySQL with Orchestrator) requires assembling these pieces explicitly, which is more work but gives more control over exact behavior and thresholds.

Knowing the terms RTO/RPO, and being able to explain the split-brain risk and why naive health-checking is dangerous, demonstrates real operational experience with HA — beyond just "you have a backup server that takes over."

Related Resources