REST API
ftm-lakehouse ships a FastAPI app that exposes the storage layer, the journal, the entity / statement read+write paths, and dataset job execution over HTTP. It carries no authentication, authorization, or rate-limiting logic – those are deployment concerns and belong in front of the app, not inside it.
Use reverse proxy in production
File serving
Although the api exposes HEAD / GET endpoints, for production use it is recommended to use a static file server like nginx. One approach for that is currently researched and developed in PutFS.
Authentication
The API is intentionally unprotected at the application layer. Run it behind a reverse proxy (Caddy / nginx / Traefik / a sidecar service) that handles authentication, authorization, and rate-limiting before forwarding to the lakehouse.
PutFS auth model is a good reference for how an operator can wire path-prefix + HTTP-method scoped tokens at the proxy layer.
Request timeouts
The API does not enforce a per-request wall-clock timeout. Configure proxy_read_timeout (nginx), timeouts (Caddy), or the equivalent in your proxy to bound how long a request can occupy a connection.
Request body size
The API does not cap request body size. Configure client_max_body_size (nginx), request_body (Caddy), or the equivalent in your proxy. Endpoints that semantically constrain content shape (e.g. entities/query capping entity_ids length) still validate after the body is parsed.
Running the API
The interactive API docs (ReDoc) are served at /.
Routes
All lakehouse-specific routes are scoped to a dataset and namespaced under /{dataset}/_api/.... The raw key-value storage layer (anystore's catch-all GET /{key:path} etc.) is mounted last for blob access but is out of scope for this API surface – see the anystore docs for that contract.
Journal
| Method | Path | Description |
|---|---|---|
POST |
/{dataset}/_api/journal/bulk |
Write JSONL rows into the journal |
GET |
/{dataset}/_api/journal/iterate |
Stream all journal rows as JSONL |
POST |
/{dataset}/_api/journal/flush |
Stream and delete journal rows as JSONL |
GET |
/{dataset}/_api/journal/count |
Get journal row count |
DELETE |
/{dataset}/_api/journal/clear |
Delete all journal rows |
Entities
| Method | Path | Description |
|---|---|---|
POST |
/{dataset}/_api/entities/flush |
Drain the journal into parquet |
POST |
/{dataset}/_api/entities/merge |
Collapse duplicates + reap expired tombstones |
POST |
/{dataset}/_api/entities/query |
Query entities, streamed as NDJSON |
POST |
/{dataset}/_api/entities/statements/query |
Query raw statements, streamed as NDJSON |
GET |
/{dataset}/_api/entities/stats |
Dataset statistics |
GET |
/{dataset}/_api/entities/statements/version |
Current Delta table version |
DELETE |
/{dataset}/_api/entities/{entity_id} |
Tombstone all statements for an entity |
Operations
| Method | Path | Description |
|---|---|---|
POST |
/{dataset}/_api/operations |
Run a job operation on a dataset |
The request body must be a serialized DatasetJobModel with a name field identifying the operation:
Available operations:
| Job name | Description |
|---|---|
CrawlJob |
Batch file ingestion from a source URI |
CompactJob |
Bin-pack small parquet files |
MergeJob |
Per-partition dedup + tombstone reap |
VacuumJob |
Delete obsolete parquet files |
ExportStatementsJob |
Export to statements.csv |
ExportEntitiesJob |
Export to entities.ftm.json |
ExportStatisticsJob |
Export to statistics.json |
ExportDocumentsJob |
Export to documents.csv |
ExportIndexJob |
Export index.json with resources |
MappingJob |
Process a CSV mapping configuration |
DownloadArchiveJob |
Export archive files to original paths |
MakeJob |
Full workflow: flush + all exports |
Pass ?force=true to skip freshness checks.
Configuration
API-only settings use the LAKEHOUSE_API_ prefix:
| Variable | Description | Default |
|---|---|---|
LAKEHOUSE_API_TITLE |
OpenAPI title | FollowTheMoney Data Lakehouse Api |
LAKEHOUSE_API_ALLOWED_ORIGINS |
CORS allow-list | ["http://localhost:3000"] |
LAKEHOUSE_API_STATIC_HEADERS |
Extra headers added to every response | {} |
LAKEHOUSE_API_MAX_ENTITY_IDS |
Maximum length of an entity_ids list in a query body (caps the SQL IN (…) clause that DuckDB has to build). |
10_000 |
LAKEHOUSE_API_MAX_FILTER_KEYS |
Maximum number of top-level keys (ftmq filter kwargs plus entity_ids / flush_first) accepted in a single query body. |
20 |
Storage URI, journal URI, shard count, etc. use the regular LAKEHOUSE_ settings – see Configuration.