CLI Reference
ftm-lakehouse provides a Typer-based command-line interface organised into sub-command groups.
Groups:
| Group | Purpose |
|---|---|
archive |
Content-addressed file storage |
entities |
Read and write FtM entities |
statements |
Read and write raw FtM statements |
mappings |
CSV-to-entity mapping configurations |
operations |
Dataset pipeline operations (export, compact, merge, vacuum, crawl) |
zfs |
ZFS dataset management |
Top-level (no group):
| Command | Purpose |
|---|---|
ls |
List dataset names in the catalog |
datasets |
Show metadata for all datasets in the catalog |
make |
Make/update a dataset (frequent shortcut, kept top-level) |
Global Options
| Option | Description |
|---|---|
--version |
Show version |
--settings |
Print current settings |
--uri |
Lakehouse URI (overrides LAKEHOUSE_URI) |
-d, --dataset |
Dataset name (required for most commands) |
Top-level
ls
datasets
make
# Build or refresh a dataset (flush journal + ensure index.json)
ftm-lakehouse -d my_dataset make
# Full pipeline: flush + all exports
ftm-lakehouse -d my_dataset make --full
# Re-run even when freshness tags say it's up-to-date
ftm-lakehouse -d my_dataset make --full --force
# Apply a config.yml override
ftm-lakehouse -d my_dataset make -c path/to/config.yml
entities
Read/write FtM entities (aggregated, statement-aware).
| Command | Purpose |
|---|---|
iterate |
Live read from the parquet store as FtM JSON lines |
stream |
Stream the pre-exported entities.ftm.json to stdout |
import |
Bulk import FtM JSON entities straight into the parquet store (bypasses the journal) |
# Live read of the parquet store
ftm-lakehouse -d my_dataset entities iterate
ftm-lakehouse -d my_dataset entities iterate -o entities.live.json
# Stream the frozen export (entities.ftm.json must exist)
ftm-lakehouse -d my_dataset entities stream
ftm-lakehouse -d my_dataset entities stream -o out.json
# Bulk import (the file is shard-sorted in memory then written directly to
# parquet – journal is bypassed for one-shot loads)
cat entities.ftm.json | ftm-lakehouse -d my_dataset entities import
ftm-lakehouse -d my_dataset entities import -i entities.ftm.json --origin bulk
ftm-lakehouse -d my_dataset entities import -i entities.ftm.json --bulk-size 250000
statements
Raw statement-grain read/write, mirroring entities at the lower level.
| Command | Purpose |
|---|---|
iterate |
Live read from the parquet store as CSV rows |
stream |
Stream the pre-exported statements.csv to stdout |
import |
Bulk import statements (CSV) straight into the parquet store |
ftm-lakehouse -d my_dataset statements iterate -o live-statements.csv
ftm-lakehouse -d my_dataset statements stream -o exported.csv
cat statements.csv | ftm-lakehouse -d my_dataset statements import
operations
Pipeline operations on a dataset.
| Command | Purpose |
|---|---|
export-statements |
Export the parquet store → exports/statements.csv |
export-entities |
Export → entities.ftm.json |
export-statistics |
Export → exports/statistics.json |
export-documents |
Export → exports/documents.csv |
compact |
Bin-pack small parquet files (cheap) |
merge |
Collapse duplicates, reap expired tombstones (expensive) |
vacuum |
Delete obsolete parquet files no longer referenced by the Delta log |
crawl |
Crawl documents from a local/remote source into the archive |
Exports
ftm-lakehouse -d my_dataset operations export-statements
ftm-lakehouse -d my_dataset operations export-entities
ftm-lakehouse -d my_dataset operations export-statistics
ftm-lakehouse -d my_dataset operations export-documents
Maintenance (async, on the parquet statement store)
# Cheap file bin-pack – does not change row contents.
ftm-lakehouse -d my_dataset operations compact
# Collapse duplicates per (shard, bucket, origin) partition; drop tombstones
# older than `LAKEHOUSE_GRACE_PERIOD_DAYS`.
ftm-lakehouse -d my_dataset operations merge
# Remove obsolete parquet files (the ones merge/compact have tombstoned).
ftm-lakehouse -d my_dataset operations vacuum
ftm-lakehouse -d my_dataset operations vacuum --retention-hours 24
All three acquire a dataset-wide write fence at .LOCK, so they don't race with each other or with append-style writes.
Crawl
ftm-lakehouse -d my_dataset operations crawl /path/to/documents
ftm-lakehouse -d my_dataset operations crawl https://example.com/files/
ftm-lakehouse -d my_dataset operations crawl /path --include "*.pdf"
ftm-lakehouse -d my_dataset operations crawl /path --exclude "*.tmp"
archive
# List archived files
ftm-lakehouse -d my_dataset archive ls
ftm-lakehouse -d my_dataset archive ls --keys # paths only
ftm-lakehouse -d my_dataset archive ls --checksums # checksums only
# Inspect / fetch
ftm-lakehouse -d my_dataset archive head <checksum>
ftm-lakehouse -d my_dataset archive get <checksum> -o document.pdf
# Bulk download to a local mirror
ftm-lakehouse -d my_dataset archive download -o /tmp/mirror
mappings
# Discover mappings
ftm-lakehouse -d my_dataset mappings ls
ftm-lakehouse -d my_dataset mappings get <content_hash>
# Process one mapping, or all mappings in the dataset
ftm-lakehouse -d my_dataset mappings process <content_hash>
ftm-lakehouse -d my_dataset mappings process
zfs
# Host-side socket agent (creates ZFS datasets on behalf of containerised clients)
ftm-lakehouse zfs agent --socket /run/zfs.sock --pool zpools/tank/lakehouse
# Manual init of a dataset's ZFS hierarchy
ftm-lakehouse zfs init my_dataset --pool zpools/tank/lakehouse
The zfs group does not require a catalog (it's about provisioning, not data).
Environment Variables
| Variable | Description | Default |
|---|---|---|
LAKEHOUSE_URI |
Base path to lakehouse storage | data |
LAKEHOUSE_JOURNAL_URI |
SQLAlchemy URI for the journal | sqlite:///:memory: |
LAKEHOUSE_ENTITY_SHARDS |
Uniform shard count per new dataset | 8 |
LAKEHOUSE_GRACE_PERIOD_DAYS |
Tombstone grace period used by operations merge |
30 |
LOG_LEVEL |
Logging level | INFO |
DEBUG |
Pretty-print tracebacks etc. | false |
See also Configuration for storage backend options (S3, GCS, Azure) and ZFS Integration.
Examples
End-to-end ingestion
export LAKEHOUSE_URI=./my_lakehouse
# Initialise the dataset
ftm-lakehouse -d my_dataset make
# Crawl some files
ftm-lakehouse -d my_dataset operations crawl /path/to/documents
# Bulk-load a pre-built entities.ftm.json (skips the journal)
ftm-lakehouse -d my_dataset entities import -i entities.ftm.json
# Build all exports
ftm-lakehouse -d my_dataset make --full
# Maintenance – async, run on a schedule in production
ftm-lakehouse -d my_dataset operations compact
ftm-lakehouse -d my_dataset operations merge
ftm-lakehouse -d my_dataset operations vacuum