Skip to content

REST API

ftm-lakehouse includes a FastAPI-based REST API for remote access to the lakehouse. It exposes journal operations and dataset job execution over HTTP with JWT-based authentication.

Running the API

uvicorn ftm_lakehouse.api:app --reload  # disable --reload for production

The interactive API docs (ReDoc) are served at /.

Configuration

API settings use the LAKEHOUSE_API_ prefix:

Variable Description Default
LAKEHOUSE_API_SECRET_KEY JWT signing key change-for-production
LAKEHOUSE_API_ACCESS_TOKEN_EXPIRE Token expiry in minutes 5
LAKEHOUSE_API_ACCESS_TOKEN_ALGORITHM JWT algorithm HS256
LAKEHOUSE_API_AUTH_REQUIRED Require authentication true
LAKEHOUSE_API_TITLE OpenAPI title FollowTheMoney Data Lakehouse Api

When auth_required is false, read-only requests (GET, HEAD, OPTIONS) are allowed without a token. Write requests are always rejected in public mode.

Authentication

The API uses JWT bearer tokens with a method + path prefix authorization model. Tokens encode a list of allowed HTTP methods and path prefixes, keeping auth logic external to the API itself.

Token structure

Tokens carry two claims:

  • methods: List of allowed HTTP methods (e.g. ["GET", "POST"]) or ["*"] for all
  • prefixes: List of allowed path prefixes or glob patterns

Examples

Allow all access:

from ftm_lakehouse.api.auth import create_access_token

token = create_access_token(methods=["*"], prefixes=["/"])

Read-only access:

token = create_access_token(methods=["GET", "HEAD", "OPTIONS"], prefixes=["/"])

Scoped to a dataset's archive:

token = create_access_token(methods=["*"], prefixes=["/my_dataset/archive/"])

Glob pattern matching:

token = create_access_token(methods=["*"], prefixes=["/*/tags"])

Routes

Storage

The base storage routes are provided by anystore and expose raw key-value access to the underlying lakehouse store. All keys are path-based -- for example, my_dataset/archive/ab/cd/ef/{checksum}/blob addresses a file blob.

Method Path Description
GET /{key:path} Retrieve a stored value by key
GET /{prefix:path}?keys=true List all keys under a prefix
GET /{prefix:path}?keys=true&glob=*.json List keys matching a glob pattern
HEAD /{key:path} Get metadata (size, content type, timestamps)
HEAD /{key:path}?checksum=true Get metadata with checksum in x-anystore-checksum header
PUT /{key:path} Store a value (request body streamed directly to storage)
DELETE /{key:path} Delete a value
PATCH /{key:path} Touch a key (update its timestamp)

GET supports HTTP range requests via the Range header (e.g. Range: bytes=0-1023) and returns 206 Partial Content with the requested byte range.

Response headers include Content-Length, Content-Type, Accept-Ranges, Last-Modified, and x-anystore-* metadata fields.

Journal

Method Path Description
POST /{dataset}/journal/bulk Write TSV rows into the journal
GET /{dataset}/journal/iterate Stream all journal rows as TSV
POST /{dataset}/journal/flush Stream and delete journal rows
GET /{dataset}/journal/count Get journal row count
DELETE /{dataset}/journal/clear Delete all journal rows

Operations

Method Path Description
POST /{dataset}/_operation Run a job operation on a dataset

The request body must be a serialized DatasetJobModel with a name field identifying the operation:

{
    "name": "CrawlJob",
    "source": "s3://bucket/path"
}

Available operations:

Job name Description
CrawlJob Batch file ingestion from a source URI
OptimizeJob Compact parquet files, optional vacuum and translog apply
ExportStatementsJob Export to statements.csv
ExportEntitiesJob Export to entities.ftm.json
ExportStatisticsJob Export to statistics.json
ExportDocumentsJob Export to documents.csv
ExportIndexJob Export index.json with resources
MappingJob Process a CSV mapping configuration
RecreateJob Rebuild parquet store from exports
DownloadArchiveJob Export archive files to original paths
MakeJob Full workflow: flush + all exports

Pass ?force=true to skip freshness checks.