Configuration

ftm-lakehouse can be configured via environment variables or YAML configuration files.

Environment Variables

Core Settings

Variable	Description	Default
`LAKEHOUSE_URI`	Base path to lakehouse storage	`./data`
`LAKEHOUSE_JOURNAL_URI`	SQLAlchemy URI for statement journal	`sqlite:///:memory:`
`LAKEHOUSE_ON_ZFS`	Enable ZFS dataset creation for local storage	`false`
`LAKEHOUSE_ZFS_POOL`	ZFS dataset path for the lakehouse root (e.g. `zpools/tank/lakehouse`)	(required when `ON_ZFS` is enabled)
`LAKEHOUSE_ZFS_SOCKET`	Unix socket path for remote ZFS operations (see ZFS Integration)	(unset)
`LAKEHOUSE_ZFS_OWNER`	`uid:gid` to chown new ZFS mountpoints to (see ZFS Integration)	(unset -- no chown)
`LAKEHOUSE_PUBLIC_URL_PREFIX`	Public URL prefix for blob URLs (supports `{{ dataset }}` Jinja-style template)	(unset)
`LAKEHOUSE_ARCHIVE_URL_EXPIRE`	Expiration for signed/tokenized archive URLs in seconds	`900` (15 min)
`LOG_LEVEL`	Logging level (DEBUG, INFO, WARNING, ERROR)	`INFO`
`DEBUG`	Enable debug mode	`false`

Basic Usage

# Local filesystem
export LAKEHOUSE_URI=./my_lakehouse

# S3 storage
export LAKEHOUSE_URI=s3://my-bucket/lakehouse
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret

# With persistent journal (for production)
export LAKEHOUSE_JOURNAL_URI=postgresql://user:pass@localhost/journal

Dataset Configuration

Each dataset can have its own config.yml file that follows the ftmq.model.Dataset specification:

name: my_dataset  # also known as "foreign_id"
title: An Awesome Dataset
description: >
  A detailed description of this dataset,
  its sources, and contents.
updated_at: 2024-09-25
category: leak  # or: sanctions, pep, etc.
publisher:
  name: Data and Research Center – DARC
  url: https://dataresearchcenter.org

Storage Backends

Local Filesystem

export LAKEHOUSE_URI=/path/to/lakehouse

Amazon S3

export LAKEHOUSE_URI=s3://bucket-name/prefix
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1

S3-Compatible (MinIO, etc.)

export LAKEHOUSE_URI=s3://bucket-name/prefix
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_ENDPOINT_URL=https://minlake.example.com

Google Cloud Storage

Requires extra install: pip install gcsfs

export LAKEHOUSE_URI=gs://bucket-name/prefix
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

Azure Blob Storage

Requires extra install: pip install adlfs

export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_ACCOUNT_NAME=your_account
export AZURE_STORAGE_ACCOUNT_KEY=your_key

Or using connection string:

export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"

Or using SAS token:

export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_ACCOUNT_NAME=your_account
export AZURE_STORAGE_SAS_TOKEN="?sv=2021-06-08&ss=b&srt=sco&sp=rwdlacyx..."

Or using Azure AD / Service Principal:

export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_ACCOUNT_NAME=your_account
export AZURE_STORAGE_TENANT_ID=your_tenant_id
export AZURE_STORAGE_CLIENT_ID=your_client_id
export AZURE_STORAGE_CLIENT_SECRET=your_client_secret

Journal Database

The statement journal buffers writes before flushing to Delta Lake storage. For production use, configure a persistent database:

SQLite (File-based)

export LAKEHOUSE_JOURNAL_URI=sqlite:///path/to/journal.db

PostgreSQL

export LAKEHOUSE_JOURNAL_URI=postgresql://user:password@host:5432/database

In-Memory (for debugging / testing)

export LAKEHOUSE_JOURNAL_URI=sqlite:///:memory:

Warning

The in-memory journal is lost when the process exits. Use a persistent database for production workloads.

Python Configuration

You can also configure programmatically:

from ftm_lakehouse import get_lakehouse, get_dataset

# Get lakehouse with custom URI
lake = get_lakehouse(uri="s3://my-bucket/lakehouse")

# Get dataset
dataset = lake.get_dataset("my_dataset")

Multi-Dataset Configuration

A lakehouse can contain multiple datasets, each with different configurations:

lakehouse/
  config.yml           # Catalog-level config
  dataset_a/
    config.yml         # Dataset A config
    archive/
    ...
  dataset_b/
    config.yml         # Dataset B config (could point to remote storage)
    ...

A dataset can reference remote storage while appearing in a local catalog:

# lakehouse/remote_dataset/config.yml
name: remote_dataset
title: Remote Dataset
# This dataset's data lives in S3
storage:
  uri: s3://remote-bucket/dataset

Catalog Configuration

The lakehouse itself can have a config.yml:

name: my-catalog
title: My Data Catalog
description: A collection of datasets
datasets:
  - name: dataset_a
  - name: dataset_b

The catalog index.json is automatically generated from dataset metadata.