lesson 01 · track 01 — foundations

The stack,
top to bottom.

Every data system you'll ever design is the same six layers in different clothes. Source, log, processing, storage, serving, consumption. Get this picture in your head and the rest of the course is just zooming in.

Learning objectives
  • Name and describe the six canonical data stack layers
  • Explain why each layer exists and what it is optimised for
  • Recognise the latency/throughput and schema-on-write/read trade-offs
  • Know which layer answers analytics queries and why production OLTP must not
12 min 4 simulations +85 XP available

Open any data-engineering job description, any system-design interview, any "modern data stack" diagram from a vendor selling you something. Squint. They are all the same shape. Six layers, top to bottom.

You will hear different names for them. Bronze, silver, gold. OLTP, OLAP, DWH, mart. Source, ingest, transform, store, serve. The vocabularies differ; the geometry doesn't. Bytes are born somewhere. They flow into a log. Something processes them. They land in storage. A query engine answers questions about them. A human or a model finally consumes the answer.

The trick to thinking like an IC5 is to hold this whole picture in your head at once — and to know, for any byte in the system, what layer it is in and why. Lose track of that and you'll spend your career debugging "the data is wrong" tickets that have nothing to do with your code.

Watch one event flow

Press play. A single click on a checkout button — let's say a $48.90 order from Brazil — has to make it from a phone all the way to the dashboard your VP looks at on Monday. It crosses every one of these layers. Burst shows what production looks like.

What each layer is actually for

Each layer exists because the layer below it has the wrong shape for the layer above. That sentence is the entire field, compressed.

  1. Source. Where bytes are born. App backends writing to Postgres, mobile clients firing analytics events, IoT sensors, third-party webhooks. The schema here is whatever the engineer who shipped the feature picked on a Tuesday. Optimised for: write throughput, transactional correctness.
  2. Log. An ordered, durable, partitioned, append-only record of everything that happened. Kafka, Kinesis, Pub/Sub. The cleanest abstraction in the whole stack — it decouples producers from consumers in time and space. If your log is healthy, every other layer can be rebuilt. Optimised for: linear writes, replay, fan-out.
  3. Processing. Where shape changes. Filter, enrich, aggregate, join, window. Spark and Flink and Beam all live here. The hard distinction here is batch (jobs over bounded data) vs stream (jobs over unbounded data) — same algebra, different runtime.
  4. Storage. Bytes that survive the next deploy. Object stores (S3, GCS) for the bulk; open table formats (Iceberg, Delta) for ACID; warehouses (Snowflake, BigQuery) for managed query. Optimised for: cheap-per-GB, columnar scans, durability.
  5. Serving. Sub-second answers. OLAP engines for BI, vector stores for ML retrieval, key-value stores for online features, search indexes for "find me X." Different shapes for different read patterns.
  6. Consumption. Dashboards, alerts, ML model inputs, billing systems, fraud rules. The whole point. The reason any of the lower layers exist.
The interview move. When asked to "design X data system," start by drawing this six-layer skeleton on the board, then mark which layers the question is actually about. Most questions live in 2-3 layers; the others are obvious.

Two forces shape every layer

Every data layer is the result of two forces fighting:

  • Latency vs throughput. Source systems are tuned for low-latency single-row writes. Storage layers are tuned for high-throughput columnar scans. The gap between them is why we need processing layers.
  • Schema-on-write vs schema-on-read. Postgres demands you declare the schema before you can insert. A data lake lets you dump JSON now and figure out the columns later. Both have a cost; the cost is just paid at different times.

When someone asks "should this be a stream or a batch job?" — they're asking which side of the latency/throughput spectrum the answer needs to be on. When someone asks "should we ETL or ELT?" — they're asking when you want to pay the schema tax.

Quick check

The vocab you actually need

You'll hear these words for the rest of the course. Click each card.