Partitioning Large GeoParquet Files¶
Portolan automatically partitions large GeoParquet files into smaller, spatially-organized chunks. This improves query performance and enables efficient cloud access patterns.
When to Partition¶
Partition GeoParquet files when:
- File size exceeds 2GB (OGC best practices threshold)
- Row count exceeds millions of features
- Spatial queries are common (partitioning enables spatial pruning)
Portolan uses the 2GB threshold by default, matching OGC GeoParquet best practices.
How It Works¶
When you add a large GeoParquet file, Portolan:
- Detects the file exceeds the threshold
- Prompts for confirmation (in interactive mode)
- Partitions using KD-tree spatial indexing
- Stores as Hive-style directories:
kdtree_cell=0/,kdtree_cell=1/, etc. - Emits
partition:*STAC extension metadata
$ portolan add large-dataset.parquet
Found 1 file(s) exceeding 2.0 GB threshold:
large-dataset.parquet (4.23 GB)
Partition large files into spatial chunks? [Y/n] y
Partitioning: large-dataset.parquet
Strategy: kdtree (data-driven spatial)
Target: ~120,000 rows per partition
████████████████████████████████████████ 100%
✓ Created 98 partitions in large-dataset/
✓ Added to collection: default
Configuration¶
Configure partitioning in .portolan/config.yaml:
partitioning:
enabled: true # Enable auto-partitioning (default: true)
prompt: true # Ask before partitioning in interactive mode (default: true)
threshold_gb: 2.0 # Size threshold in GB (default: 2.0)
strategy: kdtree # Partitioning strategy (default: kdtree)
target_rows: 120000 # Target rows per partition (default: 120,000)
Configuration Options¶
| Setting | Type | Default | Description |
|---|---|---|---|
partitioning.enabled |
bool | true |
Enable automatic partitioning during portolan add |
partitioning.prompt |
bool | true |
Prompt user before partitioning (interactive mode) |
partitioning.threshold_gb |
float | 2.0 |
File size threshold in GB |
partitioning.strategy |
string | kdtree |
Partitioning strategy |
partitioning.target_rows |
int | 120000 |
Target rows per partition |
Strategies¶
| Strategy | Description |
|---|---|
kdtree |
KD-tree spatial partitioning (default, data-driven) |
h3 |
H3 hexagonal grid partitioning (planned) |
s2 |
S2 cell partitioning (planned) |
quadkey |
Quadkey partitioning (planned) |
Currently only kdtree is implemented. Other strategies are planned for future releases.
STAC Metadata¶
Partitioned collections include partition:* extension fields:
{
"stac_extensions": [
"https://portolan-sdi.github.io/stac-partition-extension/v1.0.0/schema.json"
],
"partition:scheme": "hive",
"partition:strategy": "kdtree",
"partition:keys": [
{"name": "kdtree_cell", "type": "string"}
],
"partition:file_count": 98,
"assets": {
"data": {
"href": "./kdtree_cell=*/data.parquet",
"partition:glob": "s3://bucket/collection/kdtree_cell=*/data.parquet"
}
}
}
Consuming Partitioned Data¶
DuckDB¶
DuckDB can query partitioned data directly using the glob pattern:
SELECT *
FROM read_parquet('s3://bucket/collection/kdtree_cell=*/data.parquet')
WHERE ST_Intersects(geometry, ST_GeomFromText('POLYGON(...)'))
DuckDB automatically prunes partitions based on the Hive directory structure.
PyArrow¶
import pyarrow.parquet as pq
import pyarrow.dataset as ds
# Read as dataset with partition pruning
dataset = ds.dataset(
"s3://bucket/collection/",
partitioning="hive"
)
# Query with partition filtering
table = dataset.to_table(
filter=ds.field("kdtree_cell").isin(["0", "1", "2"])
)
GDAL/OGR¶
ogrinfo -al "/vsicurl/s3://bucket/collection/kdtree_cell=0/data.parquet"
Validation¶
Use portolan check --thorough to validate partition consistency:
$ portolan check --thorough
Checking partition structure...
✓ All partitions use consistent key: kdtree_cell
✓ All partition files have consistent schema
✓ No orphan files outside partition structure
This checks: - All partition directories use the same key pattern - All parquet files have identical schemas - No files exist outside the partition structure
Manual Partitioning¶
For more control, use the standalone portolan partition command:
portolan partition large.parquet \
--output-dir ./partitioned/ \
--strategy kdtree \
--target-rows 100000
See portolan partition --help for all options.