Named Entity Recognition
Info
Entity extraction builds on top off how ingest-file originally extracted mentioned Entities. Read more
Originally ingest-file filtered the entities returned by spaCy with a custom schema prediction model trained on existing FollowTheMoney data. Based on that, Mention-Entities are created. These mentions are resolved into actual Entities (e.g. Company, Person) during cross-referencing datasets.
This creates a problem for "smaller" OpenAleph instances: If there is not enough data to cross-reference with, these Mention entities would never resolved. As well when using the analysis standalone.
ftm-analyze introduces an improvement to this problem: Extracted names can be compared against juditha, and if they are known, the resolved entities are returned instead of mentions.
juditha allows a fast lookup (based on tantivy) against a set of known names (from FollowTheMoney data). The index can be populated by reference datasets such as company registries, sanctions lists, or PEPs.
Set up juditha
Configure the juditha store uri:
export JUDITHA_URI=/path/to/store.db
For example, to load all PEPs by OpenSanctions:
juditha load-dataset -i https://data.opensanctions.org/datasets/latest/peps/index.json
juditha build
When using ftm-analyze now, it will turn known person names into actual Person entities (instead of mentions) if they are within this PEPs list (including fuzzy matching).