lesson 08 · track 03 — processing

Streaming: Kafka,
watermarks, windows.

Kafka's mental model in five minutes. Event time vs processing time — the distinction that breaks every junior. Watermarks, windows, and late data. The exact reason your stream job's results "look wrong."

16 min 4 simulations +105 XP available
Learning objectives. After this lesson you can: (1) explain the difference between event time and processing time and why it matters for window results; (2) describe how watermarks work and the latency/completeness trade-off; (3) distinguish at-most-once, at-least-once, and exactly-once delivery semantics and explain how each is achieved in Kafka; (4) compare Flink and Spark Structured Streaming on the key dimensions an IC5 engineer cares about; (5) choose the right window type for a given aggregation requirement.

Streaming systems compute over unbounded input. There is no "end of file." The consumer must produce useful answers continuously, with bounded latency, while new data keeps arriving — and arriving out-of-order, and arriving late, and sometimes never arriving at all.

The whole field is engineered around two clocks that disagree: event time (when the thing happened in the real world, encoded in the event) and processing time (when your system saw it). They diverge constantly. A user's phone goes offline at 14:32, queues 200 events, comes back online at 14:47 and uploads them all at once. Event time of those events: 14:32. Processing time: 14:47. Your stream job has to choose which one to use, and the choice is almost always event time.

Kafka: the canonical log

Five concepts fit on the back of a napkin:

  • Topic — a named append-only log. orders, page_views.
  • Partition — a topic is split into N partitions; each partition is an independent ordered log. Ordering is per-partition, not per-topic.
  • Producer — writes records to a topic. Picks a partition by hashing a key, or round-robin if no key.
  • Consumer group — a set of consumer processes that share the work of reading a topic. Kafka assigns each partition to exactly one consumer in the group. Add consumers, work spreads. Up to the partition count, after which extra consumers idle.
  • Offset — each consumer group remembers where it last read in each partition. Read is non-destructive; rewinding to an old offset replays history.
The two interview gotchas. (1) Partition count is the parallelism cap. If you'll ever want 100 consumers, create at least 100 partitions up front — increasing later breaks key-to-partition mappings. (2) Ordering is per-key, not global. Two events for user_id=42 always land in the same partition, so they're ordered. Two events for different users may interleave.

Event time vs processing time

Imagine a "count of events per minute" job. At 14:35 (processing time) you see an event with event-time 14:32. Does it count toward the 14:32 bucket (event time) or the 14:35 bucket (processing time)? Almost always you want event time — that's what business users mean. But event time is hard, because you don't know when the event-time minute is "complete." An event for 14:32 might still arrive at 14:50.

This is what watermarks exist to solve. A watermark is the system's promise: "I think I've seen all events with event-time ≤ T." When the watermark passes 14:33, the engine emits the result for the 14:32 minute and assumes it's done. Late data that arrives after the watermark is either dropped, sent to a dead-letter, or triggers a "late update" that revises the result.

The watermark, watched

Watermark slack is a real knob. Set it tight (1 second behind max event time): low latency, lots of late data dropped. Set it loose (5 minutes behind): few drops, but every result is 5 minutes late. Production systems usually pick something tied to the p99 of observed lateness — and accept a 0.1% drop rate as the cost of timeliness.

Window types

WindowShapeUse for
TumblingFixed, non-overlapping (every minute, every hour)"events per minute"
Hopping (Sliding)Fixed, overlapping (every 30s, sized 5min)moving averages, smooth dashboards
SessionVariable, gap-based (closes after T seconds of inactivity)user sessions, IoT bursts
GlobalOne window forever; uses custom triggersrunning totals with manual flush

Delivery semantics: at-most-once, at-least-once, exactly-once

Every streaming system makes a promise about what happens when something fails. The three levels, in ascending complexity:

  • At-most-once. Each message is delivered zero or one times. If the consumer crashes after reading but before processing, the message is lost. The consumer advances its offset before processing. Simplest, but data loss is possible. Acceptable only for non-critical metrics (e.g. approximate page-view counters where loss is tolerable).
  • At-least-once. Each message is delivered one or more times. The consumer commits its offset only after successfully processing. On crash and restart, unacked messages are redelivered. No data loss, but duplicates are possible. The consumer must be idempotent (able to safely process the same message twice). This is the default for most Kafka consumers and most streaming frameworks out of the box.
  • Exactly-once. Each message is processed exactly once, with no duplicates and no loss. Hard to achieve in a distributed system. Kafka provides the building blocks: (a) idempotent producers (sequence numbers prevent broker-side duplicate writes), (b) transactional producers (atomic multi-partition writes), (c) read-committed consumers (only see committed transactions). Together these form the "exactly-once semantics" (EOS) API — but both producer and consumer must participate. If the sink (e.g. a database) doesn't support transactional writes, you need a different approach (e.g. upsert with a deduplication key).
The IC5 nuance. "Exactly-once" in streaming usually means "effectively exactly-once" — the output is as-if each message was processed once, even if it was actually processed multiple times and later deduplicated. True end-to-end exactly-once requires every component (source, stream processor, sink) to participate. In practice, at-least-once + idempotent sink is the pragmatic choice for most production systems.

Flink vs Spark Structured Streaming

Two engines dominate production streaming in 2026. Both are mature and capable; the right choice depends on your latency target, team skillset, and existing infrastructure:

Apache FlinkSpark Structured Streaming
Processing modelTrue streaming (record-by-record, pipeline-based)Micro-batch (processes small batches at a configured trigger interval)
LatencySub-second (milliseconds possible)Seconds (micro-batch interval; configurable but typically 1-30s)
State managementFirst-class: RocksDB-backed, incremental checkpoints, queryable stateManaged via Spark's structured state store; less flexible for complex stateful ops
Exactly-onceNative EOS via checkpoints + 2-phase commit to sinksEOS via WAL + idempotent sinks; depends on sink support
Event time / watermarksFirst-class, per-operator watermarks, flexible late-data handlingWatermark support in Structured Streaming; less expressive than Flink for complex patterns
Ecosystem fitStrong for dedicated streaming infrastructure; Confluent Cloud, Kinesis, CDCNatural for teams already on Spark/Databricks; unified batch+streaming codebase
Sweet spotLow-latency streaming, complex stateful logic, fraud detection, CEPSpark shops wanting near-real-time without a separate streaming cluster
The IC5 decision rule. If latency must be sub-second, or you need complex stateful processing (pattern matching, sessionization, joins across multiple streams), pick Flink. If your team lives in Spark/Databricks and the latency target is >5 seconds, Structured Streaming gets you streaming without paying the operational cost of a separate Flink cluster. Don't pick Flink just because it's "more real streaming" — it's operationally heavier and the micro-batch model is fine for most analytical use cases.

Quick check

Key takeaways

  • Always use event time for business aggregations. Processing time is non-deterministic — the same event produces different results depending on when it arrived. Event time is what the business means; use it, even though it's harder.
  • Watermarks are a latency/completeness trade-off knob. Tight watermark = low latency, more dropped late data. Loose watermark = less dropped data, higher latency on all results. Set to p99 of observed event lateness for your workload.
  • At-least-once + idempotent sink is usually the right call. True end-to-end exactly-once requires every component to participate; at-least-once with a deduplicating upsert sink achieves the same observable result at lower complexity.
  • Flink for sub-second and complex state; Spark Structured Streaming for Spark shops. Don't over-engineer: the micro-batch model handles most analytical streaming needs. Reach for Flink when latency or state complexity genuinely demands it.
  • Kafka partition count is the parallelism ceiling. You cannot have more active consumers (in a group) than partitions. Size partitions at design time; increasing them later breaks key ordering.

Vocab