Analyze Pipeline
Detected languages
ftm-analyze uses the fastText text classification library with a pre-trained model to detect the language of the document if it is not specified explicitly.
Named-entity recognition (NER)
ftm-analyze uses the SpaCy natural-language processing (NLP) framework and a number of pre-trained models for different languages to extract names of people, organizations, and countries from the text previously extracted from the Word document.
If the extracted names are known entities, they will be converted into an actual Entity (Person
, Company
, ...), otherwise they will be stored as a Mention
.
Extract patterns
In addition to NLP techniques, ftm-analyze also uses simple regular expressions to extract phone numbers, IBAN bank account numbers, and email addresses from documents.
In the special case of a valid IBAN (checked via schwifty
), an additional BankAccount
entity is created.
Write fragments
Info
Under the hood, ftm-analyze uses followthemoney-store to store entity data. followthemoney-store stores entity data as "fragments". Every fragment stores a subset of the properties. Read more about fragments
Any extracted entities or patterns are then stored in a separate entity fragment. Assuming that the Word document uploaded mentions a person named "John Doe", the entity fragment written to the FollowTheMoney Store might look like this:
id | origin | fragment | data |
---|---|---|---|
97e1f... | analyze | default |
Additionally, ftm-analyze will also create separate entities for mentions of people and organizations. While this creates some redundancy, it allows OpenAleph to take them into account during cross-referencing. For example, another entity fragment will be written because "John Doe" was recognized as a name of a person:
id | origin | fragment | data |
---|---|---|---|
310a4... | analyze | default |
Dispatch index task
At the end of the analyze
task, ftm-analyze dispatches an index
task. This pushes a task object to the index queue for OpenAleph that includes a payload with the IDs of any entities written in the previous step.
Thanks to Till Prochaska who initially wrote up the pipeline for the original Aleph Documentation