Kafka on Kubernetes: Lessons from Production

Running Kafka on Kubernetes used to be contrarian. Now it's normal — but most of the advice out there stops at "use a StatefulSet." Here's what actually mattered after years of running streaming infrastructure for time-series AI workloads, where ingest never stops and ordering is a contract.

Placement is everything

A Kafka broker losing its node is fine. Three brokers losing the same AZ is an outage. The two rules we enforce on every cluster:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: topology.kubernetes.io/zone
        labelSelector:
          matchLabels:
            app: kafka-broker

Hard anti-affinity across zones — never let the scheduler co-locate brokers, even under pressure. preferred is a trap; under a zone failure plus a rolling deploy, "preferred" becomes "ignored."
broker.rack mapped to the zone, so replica placement is rack-aware and a zone loss never takes out all replicas of a partition.

Retention: disk is a cliff, not a dial

Time-series workloads are write-heavy and bursty. The classic failure is quiet: a producer's throughput doubles, retention stays at seven days, and you discover the problem when the first broker hits 95% disk and the cluster starts shuffling leadership.

What works:

Size retention, not just time retention. retention.bytes per partition is your actual safety limit; time is a product decision.
Alert on days-until-full derived from growth rate, not on disk percentage. 80% full growing 1%/week is fine; 60% full growing 5%/day is an incident.
Move cold data out — object storage is where history belongs, brokers are for the hot window.

Consumer lag is an SLO, not a metric

Nobody cares that lag is "1.2 million messages." They care whether the downstream model is scoring on data from 40 seconds ago or 40 minutes ago. We converted lag to time-lag (latest message timestamp minus last committed message timestamp) and put SLOs on that:

max by (group, topic) (
  kafka_consumergroup_lag_seconds
) > 120

That single reframing killed an entire class of pointless pages and caught real degradations earlier.

The unglamorous list

Partition counts are forever-ish. Plan for keyed ordering before you have 200 consumers.
Test broker restarts under load, not on a quiet cluster — leader elections behave very differently at peak ingest.
Version upgrades: one broker at a time, watch under-replicated partitions return to zero, then continue. Boring is the goal.

Kafka rewards operational discipline more than clever tuning. Get placement, retention math and lag SLOs right, and it's a remarkably boring system — the highest compliment infrastructure can earn.