Partitioning Large GeoParquet Files¶

Portolan automatically partitions large GeoParquet files into smaller, spatially-organized chunks. This improves query performance and enables efficient cloud access patterns.

When to Partition¶

Partition GeoParquet files when:

File size exceeds 2GB (OGC best practices threshold)
Row count exceeds millions of features
Spatial queries are common (partitioning enables spatial pruning)

Portolan uses the 2GB threshold by default, matching OGC GeoParquet best practices.

How It Works¶

When you add a large GeoParquet file, Portolan:

Detects the file exceeds the threshold
Prompts for confirmation (in interactive mode)
Partitions using KD-tree spatial indexing
Stores as Hive-style directories: kdtree_cell=0/, kdtree_cell=1/, etc.
Emits partition:* STAC extension metadata

$ portolan add large-dataset.parquet

Found 1 file(s) exceeding 2.0 GB threshold:
  large-dataset.parquet (4.23 GB)

Partition large files into spatial chunks? [Y/n] y

Partitioning: large-dataset.parquet
  Strategy: kdtree (data-driven spatial)
  Target: ~120,000 rows per partition
  ████████████████████████████████████████ 100%

✓ Created 98 partitions in large-dataset/
✓ Added to collection: default

Configuration¶

Configure partitioning in .portolan/config.yaml:

partitioning:
  enabled: true           # Enable auto-partitioning (default: true)
  prompt: true            # Ask before partitioning in interactive mode (default: true)
  threshold_gb: 2.0       # Size threshold in GB (default: 2.0)
  strategy: kdtree        # Partitioning strategy (default: kdtree)
  target_rows: 120000     # Target rows per partition (default: 120,000)

Configuration Options¶

Setting	Type	Default	Description
`partitioning.enabled`	bool	`true`	Enable automatic partitioning during `portolan add`
`partitioning.prompt`	bool	`true`	Prompt user before partitioning (interactive mode)
`partitioning.threshold_gb`	float	`2.0`	File size threshold in GB
`partitioning.strategy`	string	`kdtree`	Partitioning strategy
`partitioning.target_rows`	int	`120000`	Target rows per partition

Strategies¶

Strategy	Description
`kdtree`	KD-tree spatial partitioning (default, data-driven)
`h3`	H3 hexagonal grid partitioning (planned)
`s2`	S2 cell partitioning (planned)
`quadkey`	Quadkey partitioning (planned)

Currently only kdtree is implemented. Other strategies are planned for future releases.

STAC Metadata¶

Partitioned collections include partition:* extension fields:

{
  "stac_extensions": [
    "https://portolan-sdi.github.io/stac-partition-extension/v1.0.0/schema.json"
  ],
  "partition:scheme": "hive",
  "partition:strategy": "kdtree",
  "partition:keys": [
    {"name": "kdtree_cell", "type": "string"}
  ],
  "partition:file_count": 98,
  "assets": {
    "data": {
      "href": "./kdtree_cell=*/data.parquet",
      "partition:glob": "s3://bucket/collection/kdtree_cell=*/data.parquet"
    }
  }
}

Consuming Partitioned Data¶

DuckDB¶

DuckDB can query partitioned data directly using the glob pattern:

SELECT *
FROM read_parquet('s3://bucket/collection/kdtree_cell=*/data.parquet')
WHERE ST_Intersects(geometry, ST_GeomFromText('POLYGON(...)'))

DuckDB automatically prunes partitions based on the Hive directory structure.

PyArrow¶

import pyarrow.parquet as pq
import pyarrow.dataset as ds

# Read as dataset with partition pruning
dataset = ds.dataset(
    "s3://bucket/collection/",
    partitioning="hive"
)

# Query with partition filtering
table = dataset.to_table(
    filter=ds.field("kdtree_cell").isin(["0", "1", "2"])
)

GDAL/OGR¶

ogrinfo -al "/vsicurl/s3://bucket/collection/kdtree_cell=0/data.parquet"

Validation¶

Use portolan check --thorough to validate partition consistency:

$ portolan check --thorough

Checking partition structure...
✓ All partitions use consistent key: kdtree_cell
✓ All partition files have consistent schema
✓ No orphan files outside partition structure

This checks: - All partition directories use the same key pattern - All parquet files have identical schemas - No files exist outside the partition structure

Manual Partitioning¶

For more control, use the standalone portolan partition command:

portolan partition large.parquet \
  --output-dir ./partitioned/ \
  --strategy kdtree \
  --target-rows 100000

See portolan partition --help for all options.