ftm-lakehouse

ftm-lakehouse provides a data standard and archive storage for leaked data, private and public document collections and structured FollowTheMoney data. The concepts and implementations are originally inspired by mmmeta, Aleph's servicelayer archive and OpenSanctions work on dataset catalog metadata.

ftm-lakehouse acts as a multi-tenant storage and retrieval mechanism for structured entity data, documents and their metadata. It provides a high-level interface for generating and sharing document collections and importing them into various search and analysis platforms, such as OpenAleph, ICIJ Datashare or Liquid Investigations.

Read the specification

What is a lakehouse?

Open formats

Given the convention-based file structure and the use of parquet files, the storage layer can be populated and consumed by other 3rd-party tools which makes it free and easy to integrate ftm-lakehouse into other analytics systems or data platforms.

As well the complete data lakehouse is stored in the file-like storage backend, including change history and versions. It doesn't rely on any other running services (like a database) and therefore maintenance, scalability and data consistency is ensured. (For runtime, a sql database is needed for task management and a write ahead journal).

Core Components

ftm-lakehouse organizes data around two main components:

Entities

The entities interface is the primary way to work with FollowTheMoney data. It provides:

Writing entities to a buffered journal for efficient batch processing
Querying entities from a Delta Lake-based statement store
Exporting to various formats (JSON, CSV, statistics)

Info

See below for the archive layer that stores source files. As per the FollowTheMoney spec and logic, files are converted into entities as well and therefore part of the Entity store as well.

Entities are stored as statements - granular property-level records that enable versioning, provenance tracking, and incremental updates.

A statement represents a single fact: one property value for one entity from one source. Each statement contains an entity_id, schema (entity type), prop (property name), value, and dataset identifier. This decomposition allows tracking where each piece of information originated - which source file, processing step, or import batch contributed a specific value. The canonical_id field enables entity deduplication by linking multiple source entities that represent the same real-world thing.

This statement-based storage model makes it possible to merge data from multiple sources while preserving full provenance, perform incremental updates without reprocessing entire datasets, and use standard file-based tools (sorting, filtering) rather than requiring database infrastructure.

from ftm_lakehouse import lake

# Write entities
lake.write_entities("my_dataset", entities, origin="import")

# Read an entity
entity = lake.get_entity("my_dataset", "entity-id-123")

# Query entities
for entity in lake.iterate_entities("my_dataset", origin="crawl"):
    process(entity)

Installation

Requires Python 3.11 or later.

pip install ftm-lakehouse

Quickstart

>> Get started here

Development

This package uses poetry for packaging and dependencies management, so first install it.

Clone this repository to a local destination.

Within the repo directory, run:

poetry install --with dev

This installs development dependencies, including pre-commit which needs to be registered:

poetry run pre-commit install

Before creating a commit, this checks for correct code formatting (isort, black) and other useful checks (see: .pre-commit-config.yaml).

Testing

ftm-lakehouse uses pytest as the testing framework.

make test

License and Copyright

ftm-lakehouse is licensed under the AGPLv3 or later license.