Iceberg Backend¶
The Iceberg backend provides lakehouse-grade versioning for Portolan catalogs using Apache Iceberg. It adds ACID transactions, rollback, and snapshot pruning on top of Portolan's standard versioning.
Installation¶
The Iceberg backend is an optional extra:
pip install portolan-cli[iceberg]
Or with pipx:
pipx install portolan-cli[iceberg]
Python 3.11+
The Iceberg backend requires Python 3.11 or later (due to PyIceberg).
Quick Start¶
# Initialize a catalog with the Iceberg backend
portolan init my-catalog --backend iceberg
# Add data (collection is inferred from the parent directory per ADR-0022;
# there is no --collection flag — place files under the collection dir first)
portolan add boundaries/data.parquet
# Check version history
portolan version list boundaries
# Rollback to a previous version
portolan version rollback boundaries 1.0.0
# Prune old versions
portolan version prune boundaries --keep 5
What It Provides¶
| Feature | File Backend (default) | Iceberg Backend |
|---|---|---|
| Version tracking | versions.json |
Iceberg snapshot properties |
| Concurrent writes | Single-writer only | ACID transactions |
| Rollback | Not supported | Instant (snapshot pointer reset) |
| Prune | Not supported | Expire old snapshots |
| Schema evolution | Manual detection | Automatic via Iceberg |
| Catalog backend | Local files only | SQLite, REST, Glue, DynamoDB, etc. |
Configuration¶
Selecting the Backend¶
# Set during init (persists in config)
portolan init my-catalog --backend iceberg
# Or set in existing catalog
portolan config set backend iceberg
Iceberg Catalog Configuration¶
The backend uses PyIceberg's configuration via environment variables:
PYICEBERG_CATALOG__PORTOLAKE__<PROPERTY>=<value>
Defaults (SQLite)¶
With no configuration, a local SQLite catalog is created:
| Setting | Default |
|---|---|
| Catalog type | sql (SQLite) |
| Catalog URI | sqlite:///<cwd>/.portolan/iceberg.db |
| Warehouse | file:///<cwd>/.portolan/warehouse |
REST Catalog¶
Connect to any Iceberg REST catalog (Tabular, Polaris, Unity Catalog, Nessie):
export PYICEBERG_CATALOG__PORTOLAKE__TYPE=rest
export PYICEBERG_CATALOG__PORTOLAKE__URI=https://my-catalog.example.com
export PYICEBERG_CATALOG__PORTOLAKE__WAREHOUSE=s3://my-bucket/warehouse
AWS Glue¶
export PYICEBERG_CATALOG__PORTOLAKE__TYPE=glue
export PYICEBERG_CATALOG__PORTOLAKE__WAREHOUSE=s3://my-bucket/warehouse
Configuration Precedence¶
- External PyIceberg config —
PYICEBERG_CATALOG__PORTOLAKE__*environment variables and/or~/.pyiceberg.yaml, resolved by PyIceberg itself (see PyIceberg configuration for the env-vs-YAML ordering) - Defaults (SQLite in
.portolan/)
Portolan delegates external-config resolution to PyIceberg; it does not separately layer env vars over the YAML file.
How It Works¶
Versioning: Semver on Iceberg Snapshots¶
Each Iceberg snapshot stores version metadata in its summary properties:
{
"portolake.version": "1.1.0",
"portolake.breaking": "false",
"portolake.message": "Updated population data",
"portolake.assets": '{"data.parquet": {"sha256": "...", ...}}',
"portolake.schema": '{"type": "geoparquet", ...}',
"portolake.changes": '["data.parquet"]'
}
No external versions.json — version info travels with the Iceberg table.
Collection-to-Table Mapping¶
Each collection maps to an Iceberg table under the portolake namespace: portolake.<collection_name>.
Spatial Optimization¶
For datasets with geometry columns, the backend automatically:
- Adds a geohash column to every table with geometry. Precision is
auto-selected by row count:
geohash_3(~150 km cells) for 100K–10M rows,geohash_4(~20 km cells) otherwise. - Adds bbox columns (
bbox_xmin,bbox_ymin,bbox_xmax,bbox_ymax) for manifest statistics. - Configures an Iceberg partition spec on the geohash column when the table has ≥100K rows. Smaller tables get the geohash column but no partition spec.
These derived columns are excluded from the STAC table:columns field and
from the static GeoParquet export.
Programmatic Usage¶
from portolan_cli.backends import get_backend
# Load the backend
backend = get_backend("iceberg")
# Or with a custom catalog
from pyiceberg.catalog import load_catalog
from portolan_cli.backends.iceberg import IcebergBackend
catalog = load_catalog("portolake", type="rest", uri="https://...")
backend = IcebergBackend(catalog=catalog)
See Iceberg API Reference for full method documentation.