Configuration
ftm-lakehouse can be configured via environment variables or YAML configuration files.
Environment Variables
Core Settings
| Variable | Description | Default |
|---|---|---|
LAKEHOUSE_URI |
Base path to lakehouse storage | ./data |
LAKEHOUSE_JOURNAL_URI |
SQLAlchemy URI for statement journal | sqlite:///:memory: |
LAKEHOUSE_ON_ZFS |
Enable ZFS dataset creation for local storage | false |
LAKEHOUSE_ZFS_POOL |
ZFS dataset path for the lakehouse root (e.g. zpools/tank/lakehouse) |
(required when ON_ZFS is enabled) |
LAKEHOUSE_ZFS_SOCKET |
Unix socket path for remote ZFS operations (see ZFS Integration) | (unset) |
LAKEHOUSE_ZFS_OWNER |
uid:gid to chown new ZFS mountpoints to (see ZFS Integration) |
(unset -- no chown) |
LAKEHOUSE_PUBLIC_URL_PREFIX |
Public URL prefix for blob URLs (supports {{ dataset }} Jinja-style template) |
(unset) |
LAKEHOUSE_ARCHIVE_URL_EXPIRE |
Expiration for signed/tokenized archive URLs in seconds | 900 (15 min) |
LOG_LEVEL |
Logging level (DEBUG, INFO, WARNING, ERROR) | INFO |
DEBUG |
Enable debug mode | false |
Basic Usage
# Local filesystem
export LAKEHOUSE_URI=./my_lakehouse
# S3 storage
export LAKEHOUSE_URI=s3://my-bucket/lakehouse
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
# With persistent journal (for production)
export LAKEHOUSE_JOURNAL_URI=postgresql://user:pass@localhost/journal
Dataset Configuration
Each dataset can have its own config.yml file that follows the ftmq.model.Dataset specification:
name: my_dataset # also known as "foreign_id"
title: An Awesome Dataset
description: >
A detailed description of this dataset,
its sources, and contents.
updated_at: 2024-09-25
category: leak # or: sanctions, pep, etc.
publisher:
name: Data and Research Center – DARC
url: https://dataresearchcenter.org
Storage Backends
Local Filesystem
Amazon S3
export LAKEHOUSE_URI=s3://bucket-name/prefix
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1
S3-Compatible (MinIO, etc.)
export LAKEHOUSE_URI=s3://bucket-name/prefix
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_ENDPOINT_URL=https://minlake.example.com
Google Cloud Storage
Requires extra install: pip install gcsfs
export LAKEHOUSE_URI=gs://bucket-name/prefix
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
Azure Blob Storage
Requires extra install: pip install adlfs
export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_ACCOUNT_NAME=your_account
export AZURE_STORAGE_ACCOUNT_KEY=your_key
Or using connection string:
export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net"
Or using SAS token:
export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_ACCOUNT_NAME=your_account
export AZURE_STORAGE_SAS_TOKEN="?sv=2021-06-08&ss=b&srt=sco&sp=rwdlacyx..."
Or using Azure AD / Service Principal:
export LAKEHOUSE_URI=az://container-name/prefix
export AZURE_STORAGE_ACCOUNT_NAME=your_account
export AZURE_STORAGE_TENANT_ID=your_tenant_id
export AZURE_STORAGE_CLIENT_ID=your_client_id
export AZURE_STORAGE_CLIENT_SECRET=your_client_secret
Journal Database
The statement journal buffers writes before flushing to Delta Lake storage. For production use, configure a persistent database:
SQLite (File-based)
PostgreSQL
In-Memory (for debugging / testing)
Warning
The in-memory journal is lost when the process exits. Use a persistent database for production workloads.
Python Configuration
You can also configure programmatically:
from ftm_lakehouse import get_lakehouse, get_dataset
# Get lakehouse with custom URI
lake = get_lakehouse(uri="s3://my-bucket/lakehouse")
# Get dataset
dataset = lake.get_dataset("my_dataset")
Multi-Dataset Configuration
A lakehouse can contain multiple datasets, each with different configurations:
lakehouse/
config.yml # Catalog-level config
dataset_a/
config.yml # Dataset A config
archive/
...
dataset_b/
config.yml # Dataset B config (could point to remote storage)
...
A dataset can reference remote storage while appearing in a local catalog:
# lakehouse/remote_dataset/config.yml
name: remote_dataset
title: Remote Dataset
# This dataset's data lives in S3
storage:
uri: s3://remote-bucket/dataset
Catalog Configuration
The lakehouse itself can have a config.yml:
name: my-catalog
title: My Data Catalog
description: A collection of datasets
datasets:
- name: dataset_a
- name: dataset_b
The catalog index.json is automatically generated from dataset metadata.