Skip to content

Aleph

Sync a ftm-datalake dataset into an Aleph instance. This uses alephclient, so the configured ALEPHCLIENT_API_KEY needs to have the appropriate permissions.

Collections will be created if they don't exist and their metadata will be updated (this can be disabled via --no-metadata). The Aleph collections foreign id can be set via --foreign-id and defaults to the ftm-datalake dataset name.

As long as using the global cache (environment CACHE=1, default) only new documents are synced. The cache handles multiple Aleph instances and keeps track of the individual status for each of them.

Aleph api configuration can as well set via command line:

ftm-datalake -d my_dataset aleph sync --host <host> --api-key <api-key>

Sync documents into a subfolder that will be created if it doesn't exist:

ftm-datalake -d my_dataset aleph sync --folder "Documents/Court cases"

Reference

Sync Aleph collections into ftm_datalake or vice versa via alephclient

sync_to_aleph(dataset, host, api_key, prefix=None, foreign_id=None, metadata=True)

Incrementally sync a ftm_datalake dataset into an Aleph instance.

Parameters:

Name Type Description Default
dataset DatasetArchive

ftm_datalake Dataset instance

required
host str | None

Aleph host (can be set via env ALEPHCLIENT_HOST)

required
api_key str | None

Aleph api key (can be set via env ALEPHCLIENT_API_KEY)

required
prefix str | None

Add a folder prefix to import documents into

None
foreign_id str | None

Aleph collection foreign_id (if different from ftm_datalake dataset name)

None
metadata bool | None

Update Aleph collection metadata

True
Source code in ftm_datalake/sync/aleph.py
def sync_to_aleph(
    dataset: DatasetArchive,
    host: str | None,
    api_key: str | None,
    prefix: str | None = None,
    foreign_id: str | None = None,
    metadata: bool | None = True,
) -> AlephUploadStatus:
    """
    Incrementally sync a ftm_datalake dataset into an Aleph instance.

    Args:
        dataset: ftm_datalake Dataset instance
        host: Aleph host (can be set via env `ALEPHCLIENT_HOST`)
        api_key: Aleph api key (can be set via env `ALEPHCLIENT_API_KEY`)
        prefix: Add a folder prefix to import documents into
        foreign_id: Aleph collection foreign_id (if different from ftm_datalake dataset name)
        metadata: Update Aleph collection metadata
    """
    worker = AlephUploadWorker(
        dataset=dataset,
        host=host,
        api_key=api_key,
        prefix=prefix,
        foreign_id=foreign_id,
        metadata=metadata,
    )
    worker.log_info(f"Starting sync to Aleph `{worker.host}` ...")
    return worker.run()