Kafka on Kubernetes: Lessons from Production
Rack-aware placement, tiered retention, consumer-lag SLOs and the failure modes nobody warns you about when you run Kafka for time-series AI workloads.
- #kafka
- #kubernetes
- #distributed-systems
- #streaming
Running Kafka on Kubernetes used to be contrarian. Now it's normal — but most of the advice out there stops at "use a StatefulSet." Here's what actually mattered after years of running streaming infrastructure for time-series AI workloads, where ingest never stops and ordering is a contract.
Placement is everything
A Kafka broker losing its node is fine. Three brokers losing the same AZ is an outage. The two rules we enforce on every cluster:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: topology.kubernetes.io/zone
labelSelector:
matchLabels:
app: kafka-broker- Hard anti-affinity across zones — never let the scheduler co-locate
brokers, even under pressure.
preferredis a trap; under a zone failure plus a rolling deploy, "preferred" becomes "ignored." broker.rackmapped to the zone, so replica placement is rack-aware and a zone loss never takes out all replicas of a partition.
Retention: disk is a cliff, not a dial
Time-series workloads are write-heavy and bursty. The classic failure is quiet: a producer's throughput doubles, retention stays at seven days, and you discover the problem when the first broker hits 95% disk and the cluster starts shuffling leadership.
What works:
- Size retention, not just time retention.
retention.bytesper partition is your actual safety limit; time is a product decision. - Alert on days-until-full derived from growth rate, not on disk percentage. 80% full growing 1%/week is fine; 60% full growing 5%/day is an incident.
- Move cold data out — object storage is where history belongs, brokers are for the hot window.
Consumer lag is an SLO, not a metric
Nobody cares that lag is "1.2 million messages." They care whether the downstream model is scoring on data from 40 seconds ago or 40 minutes ago. We converted lag to time-lag (latest message timestamp minus last committed message timestamp) and put SLOs on that:
max by (group, topic) (
kafka_consumergroup_lag_seconds
) > 120That single reframing killed an entire class of pointless pages and caught real degradations earlier.
The unglamorous list
- Partition counts are forever-ish. Plan for keyed ordering before you have 200 consumers.
- Test broker restarts under load, not on a quiet cluster — leader elections behave very differently at peak ingest.
- Version upgrades: one broker at a time, watch under-replicated partitions return to zero, then continue. Boring is the goal.
Kafka rewards operational discipline more than clever tuning. Get placement, retention math and lag SLOs right, and it's a remarkably boring system — the highest compliment infrastructure can earn.