Almost every analytical query is shaped like: "give me a few columns, filtered, aggregated, over many rows." The query engine has to read bytes from disk to answer it. The crucial question — and almost the entire reason columnar storage exists — is: how many of those bytes are useful?
In a row-oriented store, all of a row's columns are physically next to each other. To get one column from a billion rows, you read all the other columns too. The disk doesn't know which bytes you wanted; the OS just hands them over.
In a column-oriented store, all values of one column are physically next to each other. You read only the columns you asked for. For a typical analytical query that touches 4 of 200 columns, that's a 50× IO reduction. And once values are co-located by column, they compress vastly better, because they're all the same type — usually with very low cardinality.
The same query, two layouts
Anatomy of a Parquet file
Parquet is the de-facto columnar format for analytics on object storage. It's a binary file, but its structure is the most useful thing in this entire course. Memorize this layout.
PAR1, then row groups, then a footer that indexes everything, then PAR1 again. Readers seek to the end first, parse the footer to learn schema and per-column-chunk min/max, then jump to just the pages they need. Three columns ignored, one column read, one row group skipped — that's a 50× IO reduction without ever touching the bulk of the file.
Three things make this format fast:
- Columnar layout. A query for
SUM(amount)reads the amount column chunk in each row group, and nothing else. - Per-column encoding. Run-length encoding for repeated values, dictionary encoding for low-cardinality strings (
country= US, US, US, UK becomes ints 0, 0, 0, 1 + a tiny dict). Often 5-10× compression on top of GZIP/Snappy. - Min/max statistics per row group, per column. If a query says
WHERE amount > 1000and a row group's maxamountis 50, the engine skips the entire row group without reading a single byte from it. This is called predicate pushdown, and it's the main reason analytical queries can hit a petabyte and return in 200ms.
That's it. That's the whole format. ORC (Optimized Row Columnar) is Parquet's closest sibling — it's also columnar, born in Hive, and shares the same core ideas: column chunks, per-chunk statistics, and footer-first reading. Avro is the row-oriented complement: schema-embedded, great for streaming and Kafka payloads, bad for analytics. The pattern is almost always Avro on the wire, Parquet at rest.
Encodings, dissected
| Encoding | Best for | Example |
|---|---|---|
| Plain | High-entropy data; floats, hashes. | Just the raw bytes. |
| Dictionary | Low-cardinality strings. | "US"→0, "UK"→1 + index list. |
| RLE | Long runs of repeats. | [0,0,0,0,1,1] → [(0,4),(1,2)]. |
| Bit-packing | Small integers. | Pack 8 booleans into a byte. |
| Delta encoding | Sorted columns (timestamps, auto-increment IDs). | Store diffs from prev value, not absolute values. Unrelated to Delta Lake. |
Parquet vs Iceberg vs Delta Lake
A common confusion: Parquet, Iceberg, and Delta Lake are not competing formats. They operate at different levels.
- Parquet is a file format. It defines how bytes are laid out within a single file: row groups, column chunks, encodings, footer stats. It knows nothing about transactions, schema evolution across files, or which files make up a "table."
- Apache Iceberg and Delta Lake are open table formats (also called lakehouse table formats). They sit one layer above Parquet: they manage a collection of Parquet (or ORC) files as a logical table and add the things Parquet lacks: ACID transactions, snapshot isolation, time-travel queries, schema and partition evolution, and file-level manifests that enable efficient planning.
The key architectural differences between Iceberg and Delta Lake:
- Metadata model. Iceberg uses a tree of JSON manifest files — a snapshot points to one or more manifest lists, each of which points to individual Parquet files. Delta uses a transaction log (
_delta_log/) of JSON commit files with sequential numbering. Both support time travel; Iceberg's tree is better for large-scale concurrent writes; Delta's linear log is simpler to reason about for single-writer workloads. - Partition evolution. Iceberg supports partition spec changes without rewriting data — old partitions stay on the old spec, new writes use the new spec, and the engine handles both transparently. Delta Lake requires a full rewrite to change partition layout.
- Hidden partitioning. Iceberg can partition by a transform of a column (e.g.,
months(order_date)) and hide that from the query writer — queries filter byorder_dateand the engine maps to partition files automatically. Delta Lake uses physical partition directories (Hive-style), so the partition column must appear in the query. - Ecosystem. Delta Lake is tightly integrated with Databricks and Spark; Iceberg has broader engine support (Flink, Trino, Snowflake, Athena, Spark) and is the preferred default for multi-engine architectures.
Bloom filters: a free win
Min/max stats only help with range predicates. For point lookups — "WHERE user_id = 'abc-123'" — min/max is useless because almost every row group will contain that range. The fix: a Bloom filter per column chunk. A small probabilistic data structure that can answer "is X definitely not here?" with zero false negatives. False positives are tunable — usually 1%.