04 · Row vs Column: Inside Parquet

Almost every analytical query is shaped like: "give me a few columns, filtered, aggregated, over many rows." The query engine has to read bytes from disk to answer it. The crucial question — and almost the entire reason columnar storage exists — is: how many of those bytes are useful?

In a row-oriented store, all of a row's columns are physically next to each other. To get one column from a billion rows, you read all the other columns too. The disk doesn't know which bytes you wanted; the OS just hands them over.

In a column-oriented store, all values of one column are physically next to each other. You read only the columns you asked for. For a typical analytical query that touches 4 of 200 columns, that's a 50× IO reduction. And once values are co-located by column, they compress vastly better, because they're all the same type — usually with very low cardinality.

The same query, two layouts

Anatomy of a Parquet file

Parquet is the de-facto columnar format for analytics on object storage. It's a binary file, but its structure is the most useful thing in this entire course. Memorize this layout.

Fig. 01 · parquet on disk A file is row groups stacked, with a footer index at the end left → right = byte order

The whole format. Bytes from top to bottom: PAR1, then row groups, then a footer that indexes everything, then PAR1 again. Readers seek to the end first, parse the footer to learn schema and per-column-chunk min/max, then jump to just the pages they need. Three columns ignored, one column read, one row group skipped — that's a 50× IO reduction without ever touching the bulk of the file.

Three things make this format fast:

Columnar layout. A query for SUM(amount) reads the amount column chunk in each row group, and nothing else.
Per-column encoding. Run-length encoding for repeated values, dictionary encoding for low-cardinality strings (country = US, US, US, UK becomes ints 0, 0, 0, 1 + a tiny dict). Often 5-10× compression on top of GZIP/Snappy.
Min/max statistics per row group, per column. If a query says WHERE amount > 1000 and a row group's max amount is 50, the engine skips the entire row group without reading a single byte from it. This is called predicate pushdown, and it's the main reason analytical queries can hit a petabyte and return in 200ms.

That's it. That's the whole format. ORC (Optimized Row Columnar) is Parquet's closest sibling — it's also columnar, born in Hive, and shares the same core ideas: column chunks, per-chunk statistics, and footer-first reading. Avro is the row-oriented complement: schema-embedded, great for streaming and Kafka payloads, bad for analytics. The pattern is almost always Avro on the wire, Parquet at rest.

Encodings, dissected

Encoding	Best for	Example
Plain	High-entropy data; floats, hashes.	Just the raw bytes.
Dictionary	Low-cardinality strings.	"US"→0, "UK"→1 + index list.
RLE	Long runs of repeats.	[0,0,0,0,1,1] → [(0,4),(1,2)].
Bit-packing	Small integers.	Pack 8 booleans into a byte.
Delta encoding	Sorted columns (timestamps, auto-increment IDs).	Store diffs from prev value, not absolute values. Unrelated to Delta Lake.

▸

The interview move. When asked to compute storage cost or query latency, separate logical row count × bytes-per-row from physical bytes after encoding + compression. They differ by 5-20× for typical analytical data. A 1TB CSV is usually 50-200GB as Parquet.

Parquet vs Iceberg vs Delta Lake

A common confusion: Parquet, Iceberg, and Delta Lake are not competing formats. They operate at different levels.

Parquet is a file format. It defines how bytes are laid out within a single file: row groups, column chunks, encodings, footer stats. It knows nothing about transactions, schema evolution across files, or which files make up a "table."
Apache Iceberg and Delta Lake are open table formats (also called lakehouse table formats). They sit one layer above Parquet: they manage a collection of Parquet (or ORC) files as a logical table and add the things Parquet lacks: ACID transactions, snapshot isolation, time-travel queries, schema and partition evolution, and file-level manifests that enable efficient planning.

The key architectural differences between Iceberg and Delta Lake:

Metadata model. Iceberg uses a tree of JSON manifest files — a snapshot points to one or more manifest lists, each of which points to individual Parquet files. Delta uses a transaction log (_delta_log/) of JSON commit files with sequential numbering. Both support time travel; Iceberg's tree is better for large-scale concurrent writes; Delta's linear log is simpler to reason about for single-writer workloads.
Partition evolution. Iceberg supports partition spec changes without rewriting data — old partitions stay on the old spec, new writes use the new spec, and the engine handles both transparently. Delta Lake requires a full rewrite to change partition layout.
Hidden partitioning. Iceberg can partition by a transform of a column (e.g., months(order_date)) and hide that from the query writer — queries filter by order_date and the engine maps to partition files automatically. Delta Lake uses physical partition directories (Hive-style), so the partition column must appear in the query.
Ecosystem. Delta Lake is tightly integrated with Databricks and Spark; Iceberg has broader engine support (Flink, Trino, Snowflake, Athena, Spark) and is the preferred default for multi-engine architectures.

▸

The interview move. If asked "should we use Iceberg or Delta?", the honest answer is: Delta wins if you're a Databricks shop (better tooling, Liquid Clustering, DML optimisations). Iceberg wins if you need engine portability or partition evolution. Both are vastly better than bare Parquet files with Hive-style directories.

Bloom filters: a free win

Min/max stats only help with range predicates. For point lookups — "WHERE user_id = 'abc-123'" — min/max is useless because almost every row group will contain that range. The fix: a Bloom filter per column chunk. A small probabilistic data structure that can answer "is X definitely not here?" with zero false negatives. False positives are tunable — usually 1%.

Row vs column,
and inside Parquet.

The same query, two layouts

Anatomy of a Parquet file

Encodings, dissected

Parquet vs Iceberg vs Delta Lake

Bloom filters: a free win

Quick check

Vocab

Row vs column,and inside Parquet.

The same query, two layouts

Anatomy of a Parquet file

Encodings, dissected

Parquet vs Iceberg vs Delta Lake

Bloom filters: a free win

Quick check

Vocab

Row vs column,
and inside Parquet.