Mar 7, 2026
We built pg_ducklake about 3 months ago as a fork of pg_duckdb. The project brings DuckLake, a lakehouse format, into Postgres. At first, we intended to eventually merge our changes back into pg_duckdb. That didn’t work out. Here’s why, and what we did instead.
Why We Separated
pg_duckdb is a compute layer extension. It manages a DuckDB instance in each Postgres process, translates Postgres SQL into DuckDB SQL, and does more.
pg_ducklake is a storage layer extension. It introduces Postgres access method (AM), view on DuckLake table, FDW, and so on. All of these need DuckDB to handle the actual storage operations, so we ended up modifying pg_duckdb in two main areas: (1) adding hooks into the DuckDB lifecycle, and (2) adding rules to the SQL translator.
But these modifications didn’t belong in pg_duckdb, and keeping them there caused real problems:
- pg_duckdb is now at v1.2 and rarely changes its interface. DuckLake is still at v0.4 and its interface changes often. Every DuckLake update risked touching stable pg_duckdb code, and we couldn’t keep up.
- Code agents couldn’t focus. They kept crossing into pg_duckdb internals, and it was hard to stop.
- Users had to choose between our fork and upstream pg_duckdb. They couldn’t use both.
By separating we can move fast on pg_ducklake without destabilizing pg_duckdb, as well as work in a smaller, focused codebase. And users can install pg_ducklake on top of an existing pg_duckdb installation without touching it — new users install both. Even though we still have some modifications not yet contributed upstream, our vision and promise is that, once we release 1.0, pg_ducklake will work cleanly with aligned version upstream pg_duckdb.
The relationship becomes cleaner:
pg_ducklake - pg_duckdb = ducklake - duckdb
pg_ducklake extends pg_duckdb, just like DuckLake is an extension to DuckDB.
pg_ducklake - ducklake = pg_duckdb - duckdb
And pg_ducklake brings DuckLake to Postgres, just like pg_duckdb brings DuckDB to Postgres.
How We Did It
I started by asking a code agent to propose several refactor plans. It compared the diff between our pg_ducklake fork and upstream pg_duckdb, then suggested ways to extract the DuckLake-specific changes into a standalone extension. I reviewed the plans and picked the one that was logically minimal while satisfying our constraints.
The two key decisions were:
-
Dependency management. We import pg_duckdb and ducklake as git submodules. DuckDB is already a submodule of pg_duckdb. We align other DuckDB extensions (ducklake, duckdb_postgres) with that same DuckDB version. The DuckDB community promises that extensions work on a given DuckDB version (1.4, 1.5, etc.), so this keeps everything consistent.
-
Compatibility with existing pg_duckdb users. pg_ducklake must not modify pg_duckdb. It only extends it. Existing pg_duckdb users can install pg_ducklake without changing their setup.
The agent then applied the ducklake-specific changes onto an empty repo following the chosen plan. The whole refactor took about 1-2 weeks. The bottleneck was human — specifically, me. I took several days to understand and accept the new architecture, and to review the first few PRs.
Open Source in the AI Era
In this fresh repo, I set up agent infra (AGENTS.md, .agent/) — structured context for code agents to work with. With only ~10 .cpp files, code agents can narrow scope easily. Implementing a new feature takes ~30 minutes on average. Recent examples include #55 refactoring time-travel queries, #57 exporting frozen DuckLake snapshots, and #59 importing a frozen DuckLake as a read-only FDW.
We can now implement features so fast that the bottleneck is knowing what to build. So I wrote a CONTRIBUTING.md that says we prefer feature requests or bug reports over code contributions. I guess this is what an open source project looks like in the AI era.