ingest-file
ingest-file
extract useful information from documents of different types in a structured standard format. It retains folder structures across directories, compressed archives and emails. The extracted data is formatted as Follow the Money (FtM) entities, ready for import into OpenAleph, or processing as an object graph.
Supported file types:
- Plain text
- Images
- Web pages, XML documents
- PDF files
- Emails (Outlook, plain text)
- Archive files (ZIP, Rar, etc.)
- Audio and Video text extraction via ftm-transcribe
Other features:
- Extendable and composable using classes and mixins.
- Generates FollowTheMoney objects to a database as result objects.
- Queue support for distributed processing based on procrastinate
- Thoroughly tested.
Usage
License
As of release version 3.18.4 ingest-file
is licensed under the AGPLv3 or later license. Previous versions were released under the MIT license.