Data Inlining in pg_ducklake

@qsliu2017

Apr 10, 2026

Data lake formats like Apache Iceberg suffer from the small writes problem: every write produces a new file, causing degraded read performance, expensive compaction, and operational headaches.

DuckLake solves this with data inlining: it stores small batches directly in its catalog database and flushes them on demand. Their benchmark showed DuckLake with data inlining 926x faster queries and 105x faster ingestion than Apache Iceberg.

pg_ducklake supports data inlining too. And because pg_ducklake runs just inside the catalog database, those inlined writes can bypass DuckDB and go directly to Postgres, without loss of visibility and ACID.

TL;DR: We reproduced the streaming benchmark in pg_ducklake, and the result showed that pg_ducklake inserts small batches 5.8x faster than DuckDB+DuckLake.

Streaming Benchmark

We use one partition of the ClickBench HITS dataset and stream 300,000 rows in batches of 10, matching the settings from the DuckLake blog. Three scenarios:

Scenario	Insert path
pg_heap	Regular Postgres INSERT
pg_ducklake	Direct insert (UNNEST, bypasses SPI)
DuckDB + DuckLake	DuckDB with PG catalog

Both pg_ducklake and standalone DuckDB use the same resources: 12 GB memory, 8 threads.

Insert results:

Scenario	Time (s)	Rows/s
pg_heap	77.8	3,855
pg_ducklake	202.8	1,479
DuckDB + DuckLake	1,185.5	253

pg_ducklake inserts are 5.8x faster than standalone DuckDB+DuckLake. This makes sense: inlined data lives in Postgres catalog tables, and Postgres is well optimized for OLTP writes. The pg_heap baseline (3,855 rows/s) shows the ceiling: pg_ducklake is within 2.6x of raw heap inserts despite also maintaining DuckLake metadata.

Standalone DuckDB, on the other hand, goes through a different path: it manages its own connection to the catalog Postgres, serializes data through pyarrow, and handles inlining internally.

After flushing to Parquet files, both pg_ducklake and standalone DuckDB end up with the same data distribution, so follow-up queries via DuckDB achieve the same analytical performance.

[^]: The benchmark scripts are at test/benchmark/data-inlining/.

How direct insert works

In standalone DuckDB, inserting into a DuckLake table goes through DuckDB’s query engine, which then deparses values back into SQL and sends them to the catalog Postgres over a connection. The values get parsed and serialized multiple times on the round trip.

pg_ducklake skips all of that. When you execute batch inserts, the planner hook intercepts the statement before it reaches DuckDB and inserts directly into the inlined data table. The values are parsed once, and the rows go straight into PostgreSQL heap pages.

We currently only catch the INSERT INTO ... SELECT UNNEST($1), UNNEST($2), ... pattern, the most common shape in our CDC workflows. We plan to support more patterns. We’d love to hear what your insert patterns look like. Reach out on GitHub.

Conclusion

pg_ducklake is built for production data workflows in the PostgreSQL ecosystem: CDC, event streams, ETL pipelines. It gives you lakehouse storage with the OLTP insert path that Postgres is known for.

For interactive analytics, you can connect a standalone DuckDB directly to the same PostgreSQL catalog. pg_ducklake handles the ingestion, DuckDB handles the queries.

Data Inlining in pg_ducklake

Streaming Benchmark

How direct insert works

Conclusion

Recent Blogposts

Data Inlining in pg_ducklake

pg_duckpipe: What's New in March 2026

Introducing pg_duckpipe: Real-Time CDC for Your Lakehouse