col 0

col 25

col 50

col 75

col 99

DISK · {mode === 'row' ? 'row-oriented (CSV / OLTP)' : 'columnar (Parquet / ORC)'} target: col[47] revenue

{grid.map((v, i) => { const col = i % COLS; const isTarget = col === TARGET_COL; const cls = [ 'sc-c', v === 1 ? 'head' : v === 2 ? 'read' : '', isTarget ? 'target' : '', mode === 'col' && !isTarget ? 'dark' : '', mode === 'col' && isTarget && snappy && v === 2 ? 'snappy' : '', ].filter(Boolean).join(' '); return ; })}

{/* Progress gauge */}

scan progress

{Math.round(progress * 100)}%

{/* Readouts */}

bytes scanned

{bytesScanned.toFixed(2)} GB

of {bytesTotal} GB on disk

columns read

{reads.colsRead} / 100

{mode === 'row' ? 'row layout forces full scan' : 'projection pushdown'}

efficiency

{mode === 'row' ? '1×' : `${Math.round(1/(((snappy?0.28:1))/COLS))}×`}

{mode === 'row' ? 'baseline' : `${Math.round((1 - (snappy?0.28:1)/COLS) * 100)}% of disk skipped`}

scan time

{scanTime.toFixed(2)} s

at 1 GB/s

{/* Controls */}

setSnappy(e.target.checked)} /> Snappy compression shrinks column stripe ~3.5×

); } /* ============================================================ * Format spectrum strip — CSV → Parquet → Iceberg * ============================================================ */ function FormatSpectrum() { const formats = [ { name: 'CSV / JSON', kind: 'row', tagline: 'Human-readable. No schema. No types. No compression. Fine for hand-off, terrible for analytics.', traits: ['row', 'no-schema', 'uncompressed'] }, { name: 'Parquet / ORC', kind: 'col', tagline: 'Columnar on disk. Schema + types embedded. Snappy/ZSTD. The analytical default.', traits: ['columnar', 'schema', 'compressed'] }, { name: 'Iceberg / Delta / Hudi', kind: 'tbl', tagline: 'A table format on top of Parquet: metadata manifests that give you ACID, schema evolution, time travel.', traits: ['ACID', 'time-travel', 'schema-evolution'] }, ]; return (

{formats.map((f, i) => (

0{i+1}

{f.name}

{f.tagline}

{f.traits.map(t => {t})}

))}

); } /* ============================================================ * Lakehouse vs monolith diagram — static but polished * ============================================================ */ function LakehouseDiagram() { return (

Legacy · coupled

Oracle · Teradata · on-prem MPP

One box. Compute tied to its own disks. Scale one, scale both. Upgrade = migration.

DECOUPLE →

Modern · lakehouse

Compute (elastic)

Presto · Spark · Trino

reads

Storage (cheap, shared)

Parquet · ORC · HDFS · S3

Many engines read the same bytes. Compute spins up per-query, storage costs cents.

); } /* ============================================================ * Engine comparison — Presto / Spark / Snowflake cards * ============================================================ */ function EngineCards() { const engines = [ { n: 'Presto / Trino', kind: 'MPP, in-memory', fits: 'Interactive dashboards. Sub-second to tens of seconds.', not: 'Hour-long ETL jobs: it dies, can\'t retry.', icon: }, { n: 'Spark / Databricks', kind: 'distributed, fault-tolerant', fits: 'Heavy ETL. Big joins. Anything that must finish.', not: 'Quick ad-hoc: the JVM spin-up alone eats your latency.', icon: }, { n: 'Snowflake', kind: 'cloud DW → lakehouse', fits: 'Managed. Zero-ops. Good price/perf on mid-scale.', not: 'Anywhere you need to read external Parquet from a non-Snowflake engine.', icon: }, ]; return (

{engines.map(e => (

{e.icon}

{e.n}

{e.kind}

Fits {e.fits}

Avoid {e.not}

))}

); } /* ============================================================ */ function Ch0_Fundamentals({ chapter, internalMode }) { return ( <> LakehouseRow vs columnarParquetIceberg' }, { k: 'Engines', v: 'Presto · Spark · Trino · Snowflake' }, { k: 'Outcome', v: 'Read 100× less disk per query' }, ]} /> {/* --- 0.1 Decoupling --- */}

Decoupling storage from compute

The quiet shift that changed every warehouse.

A decade ago, a warehouse was a box. Oracle, Teradata, Vertica: one appliance owned both the disks and the query engine. You bought them together, you scaled them together, and if you wanted to try a new engine you migrated terabytes first.

The lakehouse move was to put the bytes in a shared object store: S3, GCS, or GCS, or Azure Blob: as open columnar files (Parquet, ORC) and let any engine read them. Compute became a job, not a server. Storage became a commodity.

{/* --- §1 The Layers --- */}

The layers

Seven layers, one query.

A warehouse query touches seven layers. Most engineers only think about two - the SQL they wrote and the table they named, and are baffled when things break in between. The stack, bottom-up: physical storage (SSD blob tier), blob (S3), file format (Parquet · Parquet · Avro), table abstraction (namespaces → tables → partitions), catalog (Glue Catalog), query engine (Presto · Spark), application (Hex · dashboards). Knowing the layer means knowing the failure mode.

{/* --- §2 Byte trace --- */}

A byte's journey

From SELECT to flash tier, and back.

Let's make storage tangible. Here's a single byte: the value of user_email for one row - traced through every stop from the SQL statement to the physical bytes on disk. Cold and warm caches have wildly different latency profiles; metastore and blob lookups are the two stops that dominate a cold run.

{/* --- Row vs columnar (flagship sim) --- */}

Row vs columnar, visualized

Why analytics loves columns.

In a row layout every record's fields are stored together: perfect for "fetch user 42" but catastrophic for "average revenue across a billion rows". The scanner has no choice but to touch every byte just to find the one column you asked for.

Columnar flips it: all values of revenue are stored contiguously on disk. The engine can skip 99% of the table and go straight to the column it needs. This is called projection pushdown, and it's the single biggest reason Parquet is the analytical default.

Columnar storage compresses beautifully because values in one column are homogenous: a column of timestamps, a column of country codes. Snappy, ZSTD, and run-length encoding routinely shrink a stripe 3–10×. The scan head has less to read and the bytes it reads unpack cheaply.

{/* --- Format spectrum --- */}

The file-format spectrum

From CSV to Iceberg.

There's a layered vocabulary worth getting right. File format is how bytes sit on disk. Table format is a catalog of files that makes them behave like a table: transactional, evolvable, time-travelable.

You rarely pick just one. A modern pipeline lands raw JSON, converts to Parquet at ingest, and registers the Parquet in an Iceberg table so SELECT ... FOR VERSION AS OF works and a bad backfill is one SQL away from rolled back.

{/* --- §3 SQL decoder + stage visualizer --- */}

How a query becomes work

Five transformations between your text and your bytes.

New hires think SQL "just runs." In fact a coordinator takes your statement through a pipeline: parser builds an AST, analyzer resolves names against the catalog, planner emits a logical tree of relational operators, then a physical plan with exchange types and worker counts, and finally a task graph of stages dispatched across the cluster. Every step is inspectable via EXPLAIN ANALYZE.

{/* --- Engine ecosystem --- */}

The engine ecosystem

Pick the engine for the query, not the other way round.

Decoupled storage means you can run different engines against the same bytes depending on what you're doing. Interactive dashboards want sub-second response; hour-long ETL wants fault tolerance. One engine is rarely best at both.

{/* --- §4 Connectors --- */}

Connectors: same SQL, different physics

The connector chooses the physics.

Trino (the Presto fork) ships a pluggable connector interface: the same SQL statement can compile down to fanning out across a thousand S3 blobs, or reading a few megabytes from local SSD, or answering straight from coordinator memory. Latency can vary by six orders of magnitude with no change to the query text.

{/* --- Anti-patterns --- */} Treating a data lake like a relational DB. UPDATE one_row WHERE id = ... on raw Parquet rewrites an entire file. Use a table format (Iceberg/Delta) that supports row-level changes, or batch the update.", "The small-files problem. 10 000 × 1 MB Parquet files is worse than 10 × 1 GB: file-listing overhead, per-file footer reads, and task spin-up dominate. Compact on a schedule.", "Landing raw CSV in the warehouse. Types unknown, no column pruning, no compression. Always convert to Parquet at ingest.", "SELECT * on a 300-column fact table. Undoes everything columnar gave you. Ask for exactly the columns you need.", "Reading Trino docs and assuming they apply to Presto. The forks diverged around 2020: function names, connector behavior, and optimizer defaults all differ.", "Treating SQL as opaque magic. Every query has a plan, and the plan is inspectable. EXPLAIN ANALYZE before you tune anything.", "Choosing Spark for a job Presto would finish in seconds. Spark cold-start is 2–10× Presto's: the JVM warm-up alone eats any interactive budget.", ]} /> A warehouse is seven layers. Knowing the layer means knowing the failure mode: metastore down is not the same as SSD tier slow.", "SQL → AST → logical → physical → stages → tasks. Five transformations between your text and your bytes. All inspectable.", "The connector chooses the physics. Same SQL, 1000× latency range. Snowflake ≠ Redis-backed cache ≠ System tables.", "Columnar formats turn analytics into skip-most-of-the-disk operations. Table formats add ACID and time travel on top.", "Read the plan before you tune the query. Filter on partition and indexed columns first. Avoid SELECT *.", ]} /> ); } window.Ch0_Fundamentals = Ch0_Fundamentals;