Services

The OpenAleph stack consists of several services for data storage, search index and application logic.

PostgreSQL and Elasticsearch

Though the example docker-compose.yml lists PostgreSQL and Elasticsearch as docker services, we recommend running these outside docker ("bare metal") as services on different machines.

Source documents

This component is referred to as Archive in various documentation sections and in the codebase. Currently, the Archive can be one of:

Local filesystem
S3-compatible blob storage
Google Cloud Storage

In the future (OpenAleph 6), more backends might be supported, for instance Azure Blob Storage.

OpenAleph stores source files via their sha1-checksums in the key prefix format aa/bb/cc/aabbcc123456.../data

Database(s) – PostgreSQL

OpenAleph uses PostgreSQL (yes, only PostgreSQL or compatible engines supported) for three purposes. This can be all in one database or three different database deployments for high performance deployments. In that case, the different database servers can be configured and resource optimized for the specific usage pattern.

If you plan to only use one database for all purposes (suitable for small to mid sized OpenAleph instances), just use OPENALEPH_DB_URI as the configuration variable.

Minimum PostgreSQL version is 13, we recommend the latest (which is currently 17).

Application data

Stores User, Groups, Permissions, Collection metadata. This is usually not a big database, except the so called documents table. This stores meta information about all documents ever uploaded to OpenAleph. This table can grow over time and become slow when iterating or sorting the entire table.

Setting: OPENALEPH_DB_URI

Entities data

This database stores all entity data in FollowTheMoney json format. It has one table per collection (dataset). This is the source of truth for Elasticsearch, all (re-)indexing reads from this store. This database can grow into many Terrabytes depending on your data.

Setting: FTM_FRAGMENTS_URI

Task queue data

This database holds the jobs data for the worker queues. Expect heavy reads and writes when running many workers. This can become the bottleneck when running large scale processing deployments and benefits from more resources and postgresql-specific optimizations.

The underlying task queue framework is procrastinate.

Setting: PROCRASTINATE_DB_URI

Search Index - Elasticsearch

OpenAleph uses Elasticsearch to provide keyword and full-text search. See openaleph-search technical documentation. Operating an Elasticsearch Cluster is out of scope for this documentation, but many tweaks and optimizations can be helpful or even necessary depending on the nature of your source data and usage patterns. As the database holds all the entity data (see above), the complete Index can always re-created from the database.

Minimum Elasticsearch version is 9. OpenAleph uses the ICU Analysis plugin for full-text processing. Refer to the documentation for how to install it. There is a pre-build docker container with the plugin available at ghcr.io/openaleph/elasticsearch

Setting: OPENALEPH_ELASTICSEARCH_URI

The setting variable can either point to one Elasticsearch node or to a json-formatted list of multiple nodes that will be used round-robin.

Cache - Redis

OpenAleph uses a cache layer that doesn't need to be persistent. The application is expecting the redis api.

Setting: REDIS_URL

Ingest File Worker

Worker service to ingest and process source files. The more replicas are deployed, the faster OpenAleph can ingest files.

Image: ghcr.io/openaleph/ingest-file

Run command: procrastinate worker -q ingest

Documentation

Dependencies:

Archive
FollowTheMoney Database
Task queue Database
Redis (though it can use it's own instance as no cache is shared with other parts of the stack)

Analyze Worker

Worker service to analyze ingested documents (NER tagging and other extractions).

Image: ghcr.io/openaleph/ftm-analyze

Run command: procrastinate worker -q analyze

Documentation

Dependencies:

FollowTheMoney Database
Task queue Database

Application Worker

Worker service to process application related tasks, including triggering ingest and analyze tasks as well as maintenance tasks. User-triggered actions like (re-)indexing or updating entities are handled by these workers, too.

Image: ghcr.io/openaleph/openaleph

Run command: procrastinate worker -q openaleph

Dependencies:

Application Database
FollowTheMoney Database
Task queue Database
Redis cache
Elasticsearch

Considerations

For small deployments, 2-3 workers might be sufficient. Deploy more for large scale deployments. It can be a good idea to separate different tasks to different worker services. For instance having dedicated snappy workers for user-triggered tasks and background workers for long running tasks. Refer to the openaleph-procrastinate documentation for how to configure different queues for different tasks so that different workers can be deployed listening to specific queues.

Api Service

Image: ghcr.io/openaleph/openaleph

Run command: gunicorn --config /aleph/gunicorn.conf.py --workers 6

Dependencies:

Application Database
FollowTheMoney Database
Task queue Database
Redis cache
Elasticsearch
Archive

Exposes the Flask-powered python api the UI is talking to. Scale as needed.

UI (Frontend)

This just serves the static assets and React router App. All other requests are passed through the api service.

Image: ghcr.io/openaleph/aleph-ui

Run command: nginx

Dependencies: None, or api service if using default pass through.

Recommendation: Per default, a reverse proxy would forward requests to only this service. But you can as well expose the Api services directly to the reverse proxy to handle all /api/... path requests directly.