heartbeats in distributed systems
In distributed systems, one of the most critical challenges is detecting failures. How do you know if a node is down, or just slow? That's where heartbeats come in. The simplest and most elegant solution to failure detection.
What Are Heartbeats?
A heartbeat is a periodic signal sent from one node to another to indicate that it's still alive and functioning. Think of it like a human pulse -> if the pulse stops, something is wrong.
The pattern is simple:
- Node A sends a heartbeat message to Node B every
Tseconds - Node B expects to receive a heartbeat within a timeout period
- If Node B doesn't receive a heartbeat, it marks Node A as failed
Failure Detection
The real power of heartbeats is in detecting when something goes wrong. If heartbeats stop arriving, the system can take action.
When a heartbeat is missed, the timeout period begins. If no heartbeat arrives before the timeout expires, the node is marked as failed.
Why Use Heartbeats?
Heartbeats solve the failure detection problem in distributed systems. Without them, you'd have no way to know if a node crashed, got disconnected, or is just processing slowly.
Key benefits:
- Simple to implement - Just periodic messages
- Low overhead - Minimal network traffic
- Fast detection - Failures detected within timeout period
- Widely used - From databases to microservices
Real-World Examples
Kubernetes uses heartbeats (called kubelet heartbeats) to monitor node health. If a node stops sending heartbeats, Kubernetes reschedules pods to healthy nodes.
Cassandra uses a gossip protocol with heartbeats to detect failed nodes and maintain cluster membership.
Raft consensus relies on heartbeats from the leader to maintain authority. If followers don't receive heartbeats, they trigger a new election.
Key Takeaways
- Heartbeats are periodic signals that prove a node is alive
- Missing heartbeats trigger failure detection
- Choose timeout values carefully—too short causes false positives, too long delays detection
- Used everywhere: Kubernetes, databases, load balancers, consensus protocols
Last updated: November 2025 • Reading time: 5 minutes