logic

The logic module contains pure, stateless transformation functions with no infrastructure dependencies. Functions here take inputs and produce outputs without side effects.

Entity Aggregation

Aggregate a stream of statement dicts into FollowTheMoney entity dicts:

from ftm_lakehouse.logic import aggregate_unsafe

for entity in aggregate_unsafe(statement_dicts, "my_dataset"):
    print(f"{entity['id']}: {entity['caption']}")

aggregate_unsafe assumes the input is pre-sorted by entity_id – the parquet store guarantees this for its queries.

`ftm_lakehouse.logic.aggregate_unsafe(data, dataset=None)`

Aggregate statement dicts (e.g. from DuckDB rows) to entity payloads.

Completely circumvents the dict -> Statement -> StatementEntity -> dict Python path, but therefore has no validation checks. Input must be sorted by entity_id (this store never resolves, so canonical_id == entity_id; the ftmq entity query still orders by canonical_id, which is the same ordering via the entity_id AS canonical_id view alias).

Source code in ftm_lakehouse/logic/entities/aggregate.py

def aggregate_unsafe(
    data: Iterator[StatementDict], dataset: str | None = None
) -> Iterator[EntityPayload]:
    """
    Aggregate statement dicts (e.g. from DuckDB rows) to entity payloads.

    Completely circumvents the dict -> Statement -> StatementEntity -> dict
    Python path, but therefore has no validation checks. Input must be sorted
    by entity_id (this store never resolves, so canonical_id == entity_id;
    the ftmq entity query still orders by canonical_id, which is the same
    ordering via the ``entity_id AS canonical_id`` view alias).
    """
    current: EntityPayload | None = None
    for statement in data:
        if current is None or statement["entity_id"] != current.id:
            if current is not None:
                yield current
            current = EntityPayload(id=statement["entity_id"], dataset=dataset)
        current.add(statement)
    if current is not None:
        yield current

Mapping Processing

Generate entities from FollowTheMoney mapping configurations:

from ftm_lakehouse.logic import map_entities
from ftm_lakehouse.model.mapping import DatasetMapping

mapping = DatasetMapping(
    dataset="my_dataset",
    content_hash="abc123...",
    queries=[...]
)

for entity in map_entities(mapping, csv_path):
    print(f"{entity.schema.name}: {entity.caption}")

`ftm_lakehouse.logic.map_entities(mapping, csv_path)`

Generate entities from a mapping configuration and source file.

Applies a FollowTheMoney mapping configuration to a CSV/tabular file and yields the resulting entities. Each entity is annotated with:

A proof property linking to the source file's content hash
An origin context identifying the mapping source

This function is the core transformation logic used by DatasetMappings.process(). It handles the iteration over mapping queries and record processing.

Parameters:

Name	Type	Description	Default
`mapping`	`DatasetMapping`	The mapping configuration containing query definitions	required
`csv_path`	`Path`	Local path to the source CSV/tabular file	required

Yields:

Type	Description
`Entities`	EntityProxy objects generated from the mapping

Example

from ftm_lakehouse.logic import map_entities
from ftm_lakehouse.model.mapping import DatasetMapping

mapping = DatasetMapping(
    dataset="my_dataset",
    content_hash="abc123...",
    queries=[...]  # FollowTheMoney mapping queries
)

for entity in map_entities(mapping, csv_path):
    print(f"{entity.schema.name}: {entity.caption}")

Parquet helpers

The DuckDB config, the statement / statement_raw view-SQL builders, and the merge-query builder used by ParquetStore via ftmq's LakeStore.

`ftm_lakehouse.logic.parquet.duckdb_config()`

LakeStore DuckDB config derived from lakehouse settings.

Per-query memory is bounded by :attr:Settings.duckdb_memory_limit (env: LAKEHOUSE_DUCKDB_MEMORY_LIMIT, default 8GB); queries exceeding the limit spill to :attr:Settings.duckdb_temp_directory (env: LAKEHOUSE_DUCKDB_TEMP_DIRECTORY) when set, otherwise to the OS temp directory DuckDB picks by default. Passed to :class:~ftmq.store.lake.LakeStore via the duckdb_config kwarg.

Source code in ftm_lakehouse/logic/parquet.py

def duckdb_config() -> dict[str, str]:
    """LakeStore DuckDB config derived from lakehouse settings.

    Per-query memory is bounded by :attr:`Settings.duckdb_memory_limit`
    (env: ``LAKEHOUSE_DUCKDB_MEMORY_LIMIT``, default ``8GB``); queries
    exceeding the limit spill to :attr:`Settings.duckdb_temp_directory`
    (env: ``LAKEHOUSE_DUCKDB_TEMP_DIRECTORY``) when set, otherwise to
    the OS temp directory DuckDB picks by default. Passed to
    :class:`~ftmq.store.lake.LakeStore` via the ``duckdb_config`` kwarg.
    """
    settings = Settings()
    config: dict[str, str] = {"memory_limit": settings.duckdb_memory_limit}
    if settings.duckdb_temp_directory:
        config["temp_directory"] = settings.duckdb_temp_directory
    return config

`ftm_lakehouse.logic.parquet.raw_view_sql(dt)`

SELECT body for the statement_raw view.

Surfaces every physical row in the Delta table, including tombstones and pre-merge duplicates. Used by :func:build_merge_sql and :meth:get_changed_entity_ids – any path that needs the physical layout visible.

Source code in ftm_lakehouse/logic/parquet.py

def raw_view_sql(dt: DeltaTable) -> str:
    """SELECT body for the ``statement_raw`` view.

    Surfaces every physical row in the Delta table, including
    tombstones and pre-merge duplicates. Used by :func:`build_merge_sql`
    and :meth:`get_changed_entity_ids` – any path that needs the
    physical layout visible.
    """
    return f"SELECT * FROM {_delta_scan_clause(dt)}"

`ftm_lakehouse.logic.parquet.live_view_sql(dt)`

SELECT body for the live statement view.

On a store kept canonical by :func:build_merge_sql (one row per statement id, fragment supersession applied, first_seen / last_seen folded) the live rows are simply the non-tombstoned physical rows – so the view is a plain filtered scan, no window function. Predicate pushdown works natively: schema / prop / entity_id filters reach delta_scan's per-file statistics (a window would be a pushdown barrier for any non-partition column).

canonical_id is not stored – this is a single-dataset store with no entity resolution, so it always equals entity_id – and is synthesised here as entity_id AS canonical_id so ftmq's query layer (which keys entity identity on canonical_id) resolves against the view unchanged. :func:raw_view_sql deliberately omits it so merge never materialises the duplicate column.

Correctness holds only on an optimized store: between a write and the next :meth:merge this view can surface duplicate ids and rows whose delete has not been applied yet. Run optimize before querying – the dedupe / supersession / grace logic lives solely in :func:build_merge_sql.

Source code in ftm_lakehouse/logic/parquet.py

def live_view_sql(dt: DeltaTable) -> str:
    """SELECT body for the live ``statement`` view.

    On a store kept canonical by :func:`build_merge_sql` (one row per
    statement id, fragment supersession applied, ``first_seen`` /
    ``last_seen`` folded) the live rows are simply the non-tombstoned
    physical rows – so the view is a plain filtered scan, no window
    function. Predicate pushdown works natively: ``schema`` / ``prop`` /
    ``entity_id`` filters reach ``delta_scan``'s per-file statistics (a
    window would be a pushdown barrier for any non-partition column).

    ``canonical_id`` is not stored – this is a single-dataset store with no
    entity resolution, so it always equals ``entity_id`` – and is synthesised
    here as ``entity_id AS canonical_id`` so ftmq's query layer (which keys
    entity identity on ``canonical_id``) resolves against the view unchanged.
    :func:`raw_view_sql` deliberately omits it so ``merge`` never materialises
    the duplicate column.

    Correctness holds only on an **optimized** store: between a write and
    the next :meth:`merge` this view can surface duplicate ids and rows
    whose delete has not been applied yet. Run ``optimize`` before
    querying – the dedupe / supersession / grace logic lives solely in
    :func:`build_merge_sql`.
    """
    return (
        f"SELECT *, entity_id AS canonical_id "
        f"FROM {_delta_scan_clause(dt)} WHERE deleted_at IS NULL"
    )

Both builders emit delta_scan('<uri>'), so a view defined from this SQL resolves the current Delta log on every query – defining it once per connection is enough; subsequent write_deltalake commits are picked up automatically. The live statement view is a plain WHERE deleted_at IS NULL scan (no window function, so predicate pushdown survives) and is only correct on an optimized store; statement_raw exposes every physical row – tombstones and pre-merge duplicates included – for merge and get_changed_entity_ids.

`ftm_lakehouse.logic.parquet.build_merge_sql(shard, bucket, origin, grace_cutoff)`

DuckDB SQL that collapses one partition for physical merge.

:func:_dedupe_sql over the raw statement_raw view (not the deduped statement) because merge needs every row visible – including tombstones within the grace window, which must persist physically to keep shadowing their live rows – scoped to one (shard, bucket, origin) partition. Output is ordered by (entity_id, fragment, prop, id, last_seen DESC) – the file sort key – so the rewritten parquet file is ready for future merges without re-sort.

Parameters:

Name	Type	Description	Default
`shard`	`str`	Target shard value (hex-padded).	required
`bucket`	`str`	Target bucket (`thing` / `interval` / `document` / `page` / `pages` / `mention`).	required
`origin`	`str`	Target origin tag – validated at the write boundary; single quotes are doubled here as defense in depth.	required
`grace_cutoff`	`datetime`	Tombstones with `deleted_at <= grace_cutoff` are dropped. Typically `now - LAKEHOUSE_GRACE_PERIOD_DAYS`.	required

Returns:

Type	Description
`str`	Executable DuckDB SQL.

Source code in ftm_lakehouse/logic/parquet.py

def build_merge_sql(
    shard: str,
    bucket: str,
    origin: str,
    grace_cutoff: datetime,
) -> str:
    """DuckDB SQL that collapses one partition for physical merge.

    :func:`_dedupe_sql` over the **raw** ``statement_raw`` view (not the
    deduped ``statement``) because ``merge`` needs every row visible –
    including tombstones within the grace window, which must persist
    physically to keep shadowing their live rows – scoped to one
    ``(shard, bucket, origin)`` partition. Output is ordered by
    ``(entity_id, fragment, prop, id, last_seen DESC)`` – the file sort
    key – so the rewritten parquet file is ready for future merges
    without re-sort.

    Args:
        shard: Target shard value (hex-padded).
        bucket: Target bucket (``thing`` / ``interval`` / ``document`` /
            ``page`` / ``pages`` / ``mention``).
        origin: Target origin tag – validated at the write boundary;
            single quotes are doubled here as defense in depth.
        grace_cutoff: Tombstones with ``deleted_at <= grace_cutoff`` are
            dropped. Typically ``now - LAKEHOUSE_GRACE_PERIOD_DAYS``.

    Returns:
        Executable DuckDB SQL.
    """
    origin = validate_origin(origin)
    return _dedupe_sql(
        source=TABLE_RAW.name,
        where=(
            f"WHERE shard = '{shard}' AND bucket = '{bucket}' "
            f"AND origin = '{origin}'"
        ),
        tombstone=(
            "(deleted_at IS NULL OR deleted_at > "
            f"TIMESTAMPTZ '{grace_cutoff.isoformat()}')"
        ),
        order_by="ORDER BY entity_id, fragment, prop, id, last_seen DESC",
    )

`ftm_lakehouse.logic.parquet.build_changed_sql(shard, bucket, since)`

DuckDB SQL for the canonical live rows of entities changed since since.

:func:_dedupe_sql over the raw statement_raw view, scoped to one (shard, bucket) partition and semi-joined to the entities with a statement whose first_seen or deleted_at is newer than since – so the result matches what a post-merge read would return for those entities without requiring a merge first: supersession applied, tombstones shadowing their live rows and then filtered by the default deleted_at IS NULL predicate. A fully deleted entity therefore yields zero rows, which is what lets the diff exporter emit a DEL op on an un-merged store. The slice deliberately spans all origins of the partition – _dedupe_sql keys both branches on origin, so per-origin rows stay isolated exactly as physical merge would leave them.

Parameters:

Name	Type	Description	Default
`shard`	`str`	Target shard value (hex-padded).	required
`bucket`	`str`	Target bucket.	required
`since`	`datetime`	Change watermark; compared against `first_seen` and `deleted_at`.	required

Returns:

Type	Description
`str`	Executable DuckDB SQL, ordered by `entity_id` so each entity's
`str`	rows stream contiguously into aggregation.

Source code in ftm_lakehouse/logic/parquet.py

def build_changed_sql(shard: str, bucket: str, since: datetime) -> str:
    """DuckDB SQL for the canonical live rows of entities changed since ``since``.

    :func:`_dedupe_sql` over the raw ``statement_raw`` view, scoped to one
    ``(shard, bucket)`` partition and semi-joined to the entities with a
    statement whose ``first_seen`` or ``deleted_at`` is newer than
    ``since`` – so the result matches what a post-merge read would return
    for those entities *without* requiring a merge first: supersession
    applied, tombstones shadowing their live rows and then filtered by
    the default ``deleted_at IS NULL`` predicate. A fully deleted entity
    therefore yields **zero** rows, which is what lets the diff exporter
    emit a ``DEL`` op on an un-merged store. The slice deliberately spans
    all origins of the partition – ``_dedupe_sql`` keys both branches on
    ``origin``, so per-origin rows stay isolated exactly as physical
    merge would leave them.

    Args:
        shard: Target shard value (hex-padded).
        bucket: Target bucket.
        since: Change watermark; compared against ``first_seen`` and
            ``deleted_at``.

    Returns:
        Executable DuckDB SQL, ordered by ``entity_id`` so each entity's
        rows stream contiguously into aggregation.
    """
    ts = f"TIMESTAMPTZ '{since.isoformat()}'"
    return _dedupe_sql(
        source=TABLE_RAW.name,
        where=(
            f"WHERE shard = '{shard}' AND bucket = '{bucket}' AND entity_id IN ("
            f"SELECT DISTINCT entity_id FROM {TABLE_RAW.name} "
            f"WHERE shard = '{shard}' AND bucket = '{bucket}' "
            f"AND (first_seen >= {ts} OR deleted_at >= {ts}))"
        ),
        order_by="ORDER BY entity_id",
    )

Both are executable DuckDB SQL strings over statement_raw, sharing the dedupe / fragment-supersession logic: build_merge_sql collapses one (shard, bucket, origin) partition for physical rewrite, build_changed_sql returns the canonical live rows of entities changed since a watermark without requiring a merge first.

Statement Serialization

Pack and unpack statements for compact storage in the journal data column:

from ftm_lakehouse.logic import pack_statement, unpack_statement

packed = pack_statement(stmt)     # unit-separator delimited string
stmt   = unpack_statement(packed) # back to Statement

`ftm_lakehouse.helpers.statements.pack_statement(stmt)`

Pack a Statement into a unit-separator delimited string.

id, entity_id, prop, schema, value, dataset, lang,

original_value, external, first_seen, last_seen, origin, prop_type

canonical_id is not serialised – this store never resolves entities, so :func:unpack_statement lets FtM default it to entity_id.

Source code in ftm_lakehouse/helpers/statements.py

def pack_statement(stmt: Statement) -> str:
    """
    Pack a Statement into a unit-separator delimited string.

    Format: id, entity_id, prop, schema, value, dataset, lang,
            original_value, external, first_seen, last_seen, origin, prop_type

    ``canonical_id`` is not serialised – this store never resolves entities,
    so :func:`unpack_statement` lets FtM default it to ``entity_id``.
    """
    parts = [
        stmt.id or "",
        stmt.entity_id,
        stmt.prop,
        stmt.schema,
        stmt.value,
        stmt.dataset,
        stmt.lang or "",
        stmt.original_value or "",
        "1" if stmt.external else "0",
        _to_iso(stmt.first_seen),
        _to_iso(stmt.last_seen),
        stmt.origin or DEFAULT_ORIGIN,
        stmt.prop_type or "",
    ]
    return UNIT_SEP.join(parts)

`ftm_lakehouse.helpers.statements.unpack_statement(data)`

Unpack a unit-separator delimited string back into a Statement.

Raises:

Type	Description
`MalformedStatementError`	If `data` has fewer than :data:`UNPACK_MIN_FIELDS` separator-delimited fields. The journal flush loop catches this and logs+skips the row so one bad row can't abort an entire flush.

Source code in ftm_lakehouse/helpers/statements.py

def unpack_statement(data: str) -> Statement:
    """Unpack a unit-separator delimited string back into a Statement.

    Raises:
        MalformedStatementError: If ``data`` has fewer than
            :data:`UNPACK_MIN_FIELDS` separator-delimited fields. The
            journal flush loop catches this and logs+skips the row so
            one bad row can't abort an entire flush.
    """
    parts = data.split(UNIT_SEP)
    if len(parts) < UNPACK_MIN_FIELDS:
        raise MalformedStatementError(
            f"Packed statement has {len(parts)} fields; "
            f"expected at least {UNPACK_MIN_FIELDS}"
        )
    return Statement(
        id=parts[0] or None,
        entity_id=parts[1],  # required
        prop=parts[2],  # required
        schema=parts[3],  # required
        value=parts[4],  # required
        dataset=parts[5],  # required
        lang=parts[6] or None,
        original_value=parts[7] or None,
        external=parts[8] == "1",
        first_seen=parts[9] or None,
        last_seen=parts[10] or None,
        origin=parts[11] or None,
    )