Skip to content

Analyze Pipeline

Detected languages

ftm-analyze uses the fastText text classification library with a pre-trained model to detect the language of the document if it is not specified explicitly.

Named-entity recognition (NER)

ftm-analyze uses the SpaCy natural-language processing (NLP) framework and a number of pre-trained models for different languages to extract names of people, organizations, and countries from the text previously extracted from the Word document.

If the extracted names are known entities, they will be converted into an actual Entity (Person, Company, ...), otherwise they will be stored as a Mention.

Extract patterns

In addition to NLP techniques, ftm-analyze also uses simple regular expressions to extract phone numbers, IBAN bank account numbers, and email addresses from documents.

In the special case of a valid IBAN (checked via schwifty), an additional BankAccount entity is created.

Write fragments

Info

Under the hood, ftm-analyze uses followthemoney-store to store entity data. followthemoney-store stores entity data as "fragments". Every fragment stores a subset of the properties. Read more about fragments

Any extracted entities or patterns are then stored in a separate entity fragment. Assuming that the Word document uploaded mentions a person named "John Doe", the entity fragment written to the FollowTheMoney Store might look like this:

id origin fragment data
97e1f... analyze default
{
  "schema": "Pages",
  "properties": {
    "peopleMentioned": ["John Doe"],
    "detectedLanguage": ["eng"]
  }
}

Additionally, ftm-analyze will also create separate entities for mentions of people and organizations. While this creates some redundancy, it allows OpenAleph to take them into account during cross-referencing. For example, another entity fragment will be written because "John Doe" was recognized as a name of a person:

id origin fragment data
310a4... analyze default
{
  "schema": "Mention",
  "properties": {
    "name": ["John Doe"],
    "document": ["97e1f..."], // ID of the `Pages` entity
    "resolved": ["356aa..."],
    "detectedSchema": ["Person"]
  }
}

Dispatch index task

At the end of the analyze task, ftm-analyze dispatches an index task. This pushes a task object to the index queue for OpenAleph that includes a payload with the IDs of any entities written in the previous step.


Thanks to Till Prochaska who initially wrote up the pipeline for the original Aleph Documentation