lesson 04 · track 02 — storage

Row vs column,
and inside Parquet.

Why columnar wins for analytics. A Parquet file dissected — encodings, row groups, statistics, predicate pushdown. The exact thing your warehouse is doing under the hood.

Learning objectives
  • Explain why columnar layout reduces IO for analytical queries compared to row-oriented storage
  • Describe the Parquet file structure: magic bytes, row groups, column chunks, pages, footer
  • Name the five Parquet encodings and state which data types each suits best
  • Explain predicate pushdown and Bloom filters as complementary row-group pruning mechanisms
13 min 4 simulations +90 XP available

Almost every analytical query is shaped like: "give me a few columns, filtered, aggregated, over many rows." The query engine has to read bytes from disk to answer it. The crucial question — and almost the entire reason columnar storage exists — is: how many of those bytes are useful?

In a row-oriented store, all of a row's columns are physically next to each other. To get one column from a billion rows, you read all the other columns too. The disk doesn't know which bytes you wanted; the OS just hands them over.

In a column-oriented store, all values of one column are physically next to each other. You read only the columns you asked for. For a typical analytical query that touches 4 of 200 columns, that's a 50× IO reduction. And once values are co-located by column, they compress vastly better, because they're all the same type — usually with very low cardinality.

The same query, two layouts

Anatomy of a Parquet file

Parquet is the de-facto columnar format for analytics on object storage. It's a binary file, but its structure is the most useful thing in this entire course. Memorize this layout.

Fig. 01 · parquet on disk A file is row groups stacked, with a footer index at the end left → right = byte order
PAR1 · magic header ROW GROUP 0 ~128 MB · ~1M rows order_id bigint page 0 PLAIN · 1 MB page 1 PLAIN · 1 MB page 2 PLAIN · 1 MB country string page 0 DICT · 64 KB page 1 RLE · 12 KB page 2 RLE · 12 KB amount decimal page 0 PLAIN · 1 MB page 1 PLAIN · 1 MB page 2 PLAIN · 1 MB ts timestamp page 0 DELTA · 24 KB page 1 DELTA · 24 KB page 2 DELTA · 24 KB ROW GROUP 1 same shape · 4 columns × N pages …repeats for every row group in the file… FOOTER read first · the index of the file ├─schema— Thrift-encoded column types ├─row group metadata— byte offsets, sizes ├─column chunk metadata— min/max, null_count, encodings └─key/value user metadata PAR1 · magic footer column chunk all values of one column, contiguous pages smallest unit of decompression · ~1 MB row group ~128 MB of rows prunable as a unit footer where readers seek first · the index byte 0 EOF file order
The whole format. Bytes from top to bottom: PAR1, then row groups, then a footer that indexes everything, then PAR1 again. Readers seek to the end first, parse the footer to learn schema and per-column-chunk min/max, then jump to just the pages they need. Three columns ignored, one column read, one row group skipped — that's a 50× IO reduction without ever touching the bulk of the file.

Three things make this format fast:

  1. Columnar layout. A query for SUM(amount) reads the amount column chunk in each row group, and nothing else.
  2. Per-column encoding. Run-length encoding for repeated values, dictionary encoding for low-cardinality strings (country = US, US, US, UK becomes ints 0, 0, 0, 1 + a tiny dict). Often 5-10× compression on top of GZIP/Snappy.
  3. Min/max statistics per row group, per column. If a query says WHERE amount > 1000 and a row group's max amount is 50, the engine skips the entire row group without reading a single byte from it. This is called predicate pushdown, and it's the main reason analytical queries can hit a petabyte and return in 200ms.

That's it. That's the whole format. ORC (Optimized Row Columnar) is Parquet's closest sibling — it's also columnar, born in Hive, and shares the same core ideas: column chunks, per-chunk statistics, and footer-first reading. Avro is the row-oriented complement: schema-embedded, great for streaming and Kafka payloads, bad for analytics. The pattern is almost always Avro on the wire, Parquet at rest.

Encodings, dissected

EncodingBest forExample
PlainHigh-entropy data; floats, hashes.Just the raw bytes.
DictionaryLow-cardinality strings."US"→0, "UK"→1 + index list.
RLELong runs of repeats.[0,0,0,0,1,1] → [(0,4),(1,2)].
Bit-packingSmall integers.Pack 8 booleans into a byte.
Delta encodingSorted columns (timestamps, auto-increment IDs).Store diffs from prev value, not absolute values. Unrelated to Delta Lake.
The interview move. When asked to compute storage cost or query latency, separate logical row count × bytes-per-row from physical bytes after encoding + compression. They differ by 5-20× for typical analytical data. A 1TB CSV is usually 50-200GB as Parquet.

Parquet vs Iceberg vs Delta Lake

A common confusion: Parquet, Iceberg, and Delta Lake are not competing formats. They operate at different levels.

  • Parquet is a file format. It defines how bytes are laid out within a single file: row groups, column chunks, encodings, footer stats. It knows nothing about transactions, schema evolution across files, or which files make up a "table."
  • Apache Iceberg and Delta Lake are open table formats (also called lakehouse table formats). They sit one layer above Parquet: they manage a collection of Parquet (or ORC) files as a logical table and add the things Parquet lacks: ACID transactions, snapshot isolation, time-travel queries, schema and partition evolution, and file-level manifests that enable efficient planning.

The key architectural differences between Iceberg and Delta Lake:

  • Metadata model. Iceberg uses a tree of JSON manifest files — a snapshot points to one or more manifest lists, each of which points to individual Parquet files. Delta uses a transaction log (_delta_log/) of JSON commit files with sequential numbering. Both support time travel; Iceberg's tree is better for large-scale concurrent writes; Delta's linear log is simpler to reason about for single-writer workloads.
  • Partition evolution. Iceberg supports partition spec changes without rewriting data — old partitions stay on the old spec, new writes use the new spec, and the engine handles both transparently. Delta Lake requires a full rewrite to change partition layout.
  • Hidden partitioning. Iceberg can partition by a transform of a column (e.g., months(order_date)) and hide that from the query writer — queries filter by order_date and the engine maps to partition files automatically. Delta Lake uses physical partition directories (Hive-style), so the partition column must appear in the query.
  • Ecosystem. Delta Lake is tightly integrated with Databricks and Spark; Iceberg has broader engine support (Flink, Trino, Snowflake, Athena, Spark) and is the preferred default for multi-engine architectures.
The interview move. If asked "should we use Iceberg or Delta?", the honest answer is: Delta wins if you're a Databricks shop (better tooling, Liquid Clustering, DML optimisations). Iceberg wins if you need engine portability or partition evolution. Both are vastly better than bare Parquet files with Hive-style directories.

Bloom filters: a free win

Min/max stats only help with range predicates. For point lookups — "WHERE user_id = 'abc-123'" — min/max is useless because almost every row group will contain that range. The fix: a Bloom filter per column chunk. A small probabilistic data structure that can answer "is X definitely not here?" with zero false negatives. False positives are tunable — usually 1%.

Quick check

Vocab