heartbeats in distributed systems

Nov 12, 2025 • 5 min read

In distributed systems, one of the most critical challenges is detecting failures. How do you know if a node is down, or just slow? That's where heartbeats come in. The simplest and most elegant solution to failure detection.

What Are Heartbeats?

A heartbeat is a periodic signal sent from one node to another to indicate that it's still alive and functioning. Think of it like a human pulse -> if the pulse stops, something is wrong.

Basic heartbeat mechanism between two nodes

The pattern is simple:

Node A sends a heartbeat message to Node B every T seconds
Node B expects to receive a heartbeat within a timeout period
If Node B doesn't receive a heartbeat, it marks Node A as failed

Failure Detection

The real power of heartbeats is in detecting when something goes wrong. If heartbeats stop arriving, the system can take action.

Failure detected after missing heartbeat at t=3s

When a heartbeat is missed, the timeout period begins. If no heartbeat arrives before the timeout expires, the node is marked as failed.

Why Use Heartbeats?

Heartbeats solve the failure detection problem in distributed systems. Without them, you'd have no way to know if a node crashed, got disconnected, or is just processing slowly.

Key benefits:

Simple to implement - Just periodic messages
Low overhead - Minimal network traffic
Fast detection - Failures detected within timeout period
Widely used - From databases to microservices

Leader monitoring multiple worker nodes via heartbeats

Real-World Examples

Kubernetes uses heartbeats (called kubelet heartbeats) to monitor node health. If a node stops sending heartbeats, Kubernetes reschedules pods to healthy nodes.

Cassandra uses a gossip protocol with heartbeats to detect failed nodes and maintain cluster membership.

Raft consensus relies on heartbeats from the leader to maintain authority. If followers don't receive heartbeats, they trigger a new election.

Key Takeaways

Heartbeats are periodic signals that prove a node is alive
Missing heartbeats trigger failure detection
Choose timeout values carefully—too short causes false positives, too long delays detection
Used everywhere: Kubernetes, databases, load balancers, consensus protocols

Last updated: November 2025 • Reading time: 5 minutes