Architecture
This document describes the layered architecture of ftm-lakehouse.
Overview
The codebase follows a strict layered architecture with clear separation of concerns:
ftm_lakehouse/
├── lake.py # Public convenience functions
├── catalog.py # Catalog class (multi-dataset)
├── dataset.py # Dataset class (single dataset)
│
├── model/ # Layer 1: Pure data structures
├── storage/ # Layer 2: Single-purpose storage interfaces
├── repository/ # Layer 3: Domain-specific storage combinations
├── operation/ # Layer 4: Multi-step workflows (internal)
│
└── core/ # Cross-cutting concerns
└── conventions/ # Path and tag conventions
Dependency Rules
Layers can only depend on layers below them:
flowchart TD
subgraph Public["Public API"]
API["lake.py / catalog.py / dataset.py"]
end
subgraph Layer4["Layer 4"]
OP[operation]
end
subgraph Layer3["Layer 3"]
REPO[repository]
end
subgraph Layer2["Layer 2"]
STORE[storage]
end
subgraph Layer1["Layer 1"]
MODEL[model]
end
CORE[core]
API --> REPO
API --> OP
API --> CORE
OP --> REPO
OP --> CORE
REPO --> STORE
REPO --> CORE
STORE --> MODEL
STORE --> CORE
Layer 1: Model
Pure data structures with no dependencies. Pydantic models for serialization.
model/
file.py # File, Files - archived file metadata
mapping.py # DatasetMapping - CSV transformation configs
job.py # JobModel, DatasetJobModel - job execution tracking
crud.py # Crud, CrudAction, CrudResource - queue action payloads
dataset.py # CatalogModel, DatasetModel - catalog/dataset metadata
Principles:
- No behavior beyond validation
- No storage awareness
- No external dependencies (except pydantic, anystore.model)
See Model Reference for API details.
Layer 2: Storage
Single-purpose storage interfaces. Each store does ONE thing.
storage/
parquet.py # ParquetStore, TranslogStore - Delta Lake statement + metadata
journal.py # JournalStore - SQL statement buffer (write-ahead log)
tags.py # TagStore - key-value freshness tracking
queue.py # QueueStore - CRUD action queue
versions.py # VersionStore - timestamped snapshots
Blob, file metadata, and text storage are handled directly by repositories
using anystore.Store instances via get_store(), eliminating a layer of
indirection.
Translog pattern
The main parquet table stores immutable FtM statements using the upstream ARROW_SCHEMA (no deleted_at column). A lightweight translog Delta table tracks mutable per-statement metadata:
| Column | Type | Purpose |
|---|---|---|
id |
string | Statement ID (primary key) |
first_seen |
timestamp[us] | When the statement was first written |
last_seen |
timestamp[us] | When the statement was last seen (updated on re-add) |
deleted_at |
timestamp[us] | Soft-delete marker (NULL = live) |
All queries join main + translog, filtering deleted_at IS NULL and using translog timestamps. This separates immutable data (statements) from mutable metadata (timestamps, deletes) and avoids writing tombstone rows into the main table.
During flush, the journal is split three ways:
- New statements → append to main table + insert into translog
- Duplicate statements → update translog
last_seenonly (main table untouched) - Tombstones → update translog
deleted_atonly (main table untouched)
Principles:
- Each store is independent - no cross-store awareness
- Operates on a single storage URI
- Returns/accepts model objects
- No business logic
See Storage Reference for API details.
Layer 3: Repository
Domain-specific combinations of multiple stores. Each repository owns ONE domain concept.
repository/
base.py # BaseRepository - common repository interface
archive.py # ArchiveRepository - blobs, file metadata, text (via get_store)
entities.py # EntityRepository - uses JournalStore + ParquetStore
documents.py # DocumentRepository - compiled document metadata CSV + diffs
mapping.py # MappingRepository - uses VersionStore
job.py # JobRepository - job tracking (via get_store)
factories.py # Cached factory functions (get_archive, get_entities, etc.)
Principles:
- Combines stores for a single domain concept
- May use
get_store()directly for simple storage needs (blobs, metadata JSON) - No cross-domain awareness (ArchiveRepository doesn't know about statements)
- Provides domain-specific operations
- Uses TagStore for freshness tracking
See Repository Reference for API details.
Layer 4: Operation
Multi-step workflows that coordinate across repositories. This is where "action chains" are made explicit.
operation/
base.py # DatasetJobOperation - base class with freshness checks
export.py # ExportStatementsOperation, ExportEntitiesOperation, etc.
crawl.py # CrawlOperation - source → files → entities
mapping.py # MappingOperation - config → entities → journal
optimize.py # OptimizeOperation - compact Delta Lake parquet files
Principles:
- Operations are internal (not exposed to clients directly)
- Make multi-step processes explicit
- Handle freshness checks via
@skip_if_latestdecorator orensure_flush() - May span multiple repositories
- Create job run records for tracking
See Operation Reference for API details.
Layer 5: Public API
The public interface that clients use.
lake.py # Convenience functions: get_lakehouse(), get_dataset(), get_archive(), etc.
catalog.py # Catalog class - multi-dataset management
dataset.py # Dataset class - single dataset interface
Key Classes:
- Catalog - Multi-dataset management:
get_dataset(),list_datasets(),create_dataset() - Dataset - Single dataset interface with repository access:
archive,entities,mappings,jobs
Convenience functions in lake.py:
get_lakehouse()- Get the catalogget_dataset()/ensure_dataset()- Get or create a datasetget_entities()/get_archive()/get_mappings()- Repository shortcuts
See Lake Reference for API details.
Core
Cross-cutting concerns used by all layers.
core/
settings.py # Configuration from environment (Settings, ApiSettings)
config.py # Config loading utilities (load_config)
conventions/
path.py # Path patterns (archive/, mappings/, exports/, etc.)
tag.py # Tag keys (journal/last_updated, exports/statements, etc.)
Principles:
- No business logic
- Pure utilities and configuration
- Can be used by any layer
Additional Modules:
helpers/ # Domain-specific utilities
file.py # File handling (mime_to_schema, etc.)
statements.py # Statement pack/unpack for journal
dataset.py # Dataset resource builders
serialization.py # Model serialization utilities
Usage Examples
For detailed usage examples, see:
- Quickstart - Getting started guide
- Working with Entities - Entity/statement operations
- Working with Files - File archive operations
- Working with Mappings - CSV mapping operations
File Layout
Complete directory structure:
ftm_lakehouse/
├── __init__.py # Exports: Catalog, Dataset, get_lakehouse, etc.
├── lake.py # get_lakehouse(), get_dataset(), ensure_dataset()
├── catalog.py # Catalog class
├── dataset.py # Dataset class
├── cli.py # CLI entry point (typer-based)
├── util.py # General utilities
├── exceptions.py
│
├── model/
│ ├── __init__.py # Exports all models
│ ├── file.py # File, Files
│ ├── mapping.py # DatasetMapping
│ ├── job.py # JobModel, DatasetJobModel
│ ├── crud.py # Crud, CrudAction, CrudResource
│ └── dataset.py # CatalogModel, DatasetModel
│
├── storage/
│ ├── __init__.py # Exports all stores
│ ├── parquet.py # ParquetStore, TranslogStore
│ ├── journal.py # JournalStore, JournalWriter
│ ├── tags.py # TagStore
│ ├── queue.py # QueueStore
│ └── versions.py # VersionStore
│
├── repository/
│ ├── __init__.py # Exports all repositories
│ ├── base.py # BaseRepository
│ ├── archive.py # ArchiveRepository
│ ├── entities.py # EntityRepository
│ ├── mapping.py # MappingRepository
│ ├── job.py # JobRepository
│ └── factories.py # Cached factory functions
│
├── operation/
│ ├── __init__.py # Exports all operations
│ ├── base.py # DatasetJobOperation
│ ├── export.py # Export operations
│ ├── crawl.py # CrawlOperation
│ ├── mapping.py # MappingOperation
│ └── optimize.py # OptimizeOperation
│
├── helpers/
│ ├── file.py # File utilities
│ ├── statements.py # Statement pack/unpack
│ ├── dataset.py # Resource builders
│ └── serialization.py # Model serialization
│
├── logic/
│ ├── __init__.py
│ ├── entities.py # Entity logic
│ ├── parquet.py # Translog-aware DuckDB query helpers
│ └── mappings.py # Mapping logic
│
├── api/
│ ├── __init__.py
│ ├── main.py # FastAPI app
│ ├── auth.py # Authentication
│ └── util.py # API utilities
│
└── core/
├── __init__.py # Exports: Settings, load_config
├── settings.py # Settings, ApiSettings
├── config.py # load_config()
└── conventions/
├── __init__.py # Exports: path, tag modules
├── path.py # Path conventions
└── tag.py # Tag keys
Key Principles
- Each storage does ONE thing - no cross-storage awareness
- Repositories combine storages - for ONE domain concept
- Operations are explicit workflows - no hidden side effects
- Freshness is explicit - checked in operations, not decorators
- Public API is simple - delegates to repositories/operations
__init__.pyexports only - no logic in init files- Strict layer dependencies - upper layers depend on lower layers only