← back to blogs

heartbeats in distributed systems

Nov 12, 20255 min read

In distributed systems, one of the most critical challenges is detecting failures. How do you know if a node is down, or just slow? That's where heartbeats come in. The simplest and most elegant solution to failure detection.

What Are Heartbeats?

A heartbeat is a periodic signal sent from one node to another to indicate that it's still alive and functioning. Think of it like a human pulse -> if the pulse stops, something is wrong.

Node A Node B Periodic heartbeat messages Interval: 1s
Basic heartbeat mechanism between two nodes

The pattern is simple:

  • Node A sends a heartbeat message to Node B every T seconds
  • Node B expects to receive a heartbeat within a timeout period
  • If Node B doesn't receive a heartbeat, it marks Node A as failed

Failure Detection

The real power of heartbeats is in detecting when something goes wrong. If heartbeats stop arriving, the system can take action.

t=0 t=1s t=2s t=3s t=4s Waiting for heartbeats...
Failure detected after missing heartbeat at t=3s

When a heartbeat is missed, the timeout period begins. If no heartbeat arrives before the timeout expires, the node is marked as failed.

Why Use Heartbeats?

Heartbeats solve the failure detection problem in distributed systems. Without them, you'd have no way to know if a node crashed, got disconnected, or is just processing slowly.

Key benefits:

  • Simple to implement - Just periodic messages
  • Low overhead - Minimal network traffic
  • Fast detection - Failures detected within timeout period
  • Widely used - From databases to microservices
Leader Node 1 Node 2 Node 3 Node 4 All nodes send heartbeats to leader
Leader monitoring multiple worker nodes via heartbeats

Real-World Examples

Kubernetes uses heartbeats (called kubelet heartbeats) to monitor node health. If a node stops sending heartbeats, Kubernetes reschedules pods to healthy nodes.

heartbeat kubernetes autodiscovery
heartbeat kubernetes autodiscovery

Cassandra uses a gossip protocol with heartbeats to detect failed nodes and maintain cluster membership.

gossip protocol in cassandra
gossip protocol in cassandra

Raft consensus relies on heartbeats from the leader to maintain authority. If followers don't receive heartbeats, they trigger a new election.

Key Takeaways

  • Heartbeats are periodic signals that prove a node is alive
  • Missing heartbeats trigger failure detection
  • Choose timeout values carefully—too short causes false positives, too long delays detection
  • Used everywhere: Kubernetes, databases, load balancers, consensus protocols

Last updated: November 2025 • Reading time: 5 minutes