REST API

ftm-lakehouse ships a FastAPI app that exposes the storage layer, the journal, the entity / statement read+write paths, and dataset job execution over HTTP. It carries no authentication, authorization, or rate-limiting logic – those are deployment concerns and belong in front of the app, not inside it.

Use reverse proxy in production

File serving

Although the api exposes HEAD / GET endpoints, for production use it is recommended to use a static file server like nginx. One approach for that is currently researched and developed in PutFS.

Authentication

The API is intentionally unprotected at the application layer. Run it behind a reverse proxy (Caddy / nginx / Traefik / a sidecar service) that handles authentication, authorization, and rate-limiting before forwarding to the lakehouse.

PutFS auth model is a good reference for how an operator can wire path-prefix + HTTP-method scoped tokens at the proxy layer.

Request timeouts

The API does not enforce a per-request wall-clock timeout. Configure proxy_read_timeout (nginx), timeouts (Caddy), or the equivalent in your proxy to bound how long a request can occupy a connection.

Request body size

The API does not cap request body size. Configure client_max_body_size (nginx), request_body (Caddy), or the equivalent in your proxy. Endpoints that semantically constrain content shape (e.g. entities/query capping entity_ids length) still validate after the body is parsed.

Running the API

uvicorn ftm_lakehouse.api:app --reload  # disable --reload for production

The interactive API docs (ReDoc) are served at /.

Routes

All lakehouse-specific routes are scoped to a dataset and namespaced under /{dataset}/_api/.... The raw key-value storage layer (anystore's catch-all GET /{key:path} etc.) is mounted last for blob access but is out of scope for this API surface – see the anystore docs for that contract.

Journal

Method	Path	Description
`POST`	`/{dataset}/_api/journal/bulk`	Write JSONL rows into the journal
`GET`	`/{dataset}/_api/journal/iterate`	Stream all journal rows as JSONL
`POST`	`/{dataset}/_api/journal/flush`	Stream and delete journal rows as JSONL
`GET`	`/{dataset}/_api/journal/count`	Get journal row count
`DELETE`	`/{dataset}/_api/journal/clear`	Delete all journal rows

Entities

Method	Path	Description
`POST`	`/{dataset}/_api/entities/flush`	Drain the journal into parquet
`POST`	`/{dataset}/_api/entities/merge`	Collapse duplicates + reap expired tombstones
`POST`	`/{dataset}/_api/entities/query`	Query entities, streamed as NDJSON
`POST`	`/{dataset}/_api/entities/statements/query`	Query raw statements, streamed as NDJSON
`GET`	`/{dataset}/_api/entities/stats`	Dataset statistics
`GET`	`/{dataset}/_api/entities/statements/version`	Current Delta table version
`DELETE`	`/{dataset}/_api/entities/{entity_id}`	Tombstone all statements for an entity

Operations

Method	Path	Description
`POST`	`/{dataset}/_api/operations`	Run a job operation on a dataset

The request body must be a serialized DatasetJobModel with a name field identifying the operation:

{
    "name": "CrawlJob",
    "source": "s3://bucket/path"
}

Available operations:

Job name	Description
`CrawlJob`	Batch file ingestion from a source URI
`OptimizeJob`	Merge duplicates / reap tombstones, bin-pack small files, delete obsolete files
`ExportJob`	Export by `kind`: `statements`, `entities`, `documents`, `statistics`, `index`
`MappingJob`	Process a CSV mapping configuration
`DownloadArchiveJob`	Export archive files to original paths
`MakeJob`	Full workflow: flush + all exports

Pass ?force=true to skip freshness checks.

Configuration

API-only settings use the LAKEHOUSE_API_ prefix:

Variable	Description	Default
`LAKEHOUSE_API_TITLE`	OpenAPI title	`FollowTheMoney Data Lakehouse Api`
`LAKEHOUSE_API_STATIC_HEADERS`	Extra headers added to every response	`{}`
`LAKEHOUSE_API_MAX_ENTITY_IDS`	Maximum length of an `entity_ids` list in a query body (caps the SQL `IN (…)` clause that DuckDB has to build).	`10_000`
`LAKEHOUSE_API_MAX_FILTER_KEYS`	Maximum number of top-level keys (ftmq filter kwargs plus `entity_ids` / `flush_first`) accepted in a single query body.	`20`

Storage URI, journal URI, shard count, etc. use the regular LAKEHOUSE_ settings – see Configuration.