Skip to content

Docs Python test and package pre-commit Coverage Status AGPLv3+ License Pydantic v2

ingest-file

ingest-file extract useful information from documents of different types in a structured standard format. It retains folder structures across directories, compressed archives and emails. The extracted data is formatted as Follow the Money (FtM) entities, ready for import into OpenAleph, or processing as an object graph.

Supported file types:

  • Plain text
  • Images
  • Web pages, XML documents
  • PDF files
  • Emails (Outlook, plain text)
  • Archive files (ZIP, Rar, etc.)
  • Audio and Video text extraction via ftm-transcribe

See all mime types

Other features:

  • Extendable and composable using classes and mixins.
  • Generates FollowTheMoney objects to a database as result objects.
  • Queue support for distributed processing based on procrastinate
  • Thoroughly tested.

Usage

License

As of release version 3.18.4 ingest-file is licensed under the AGPLv3 or later license. Previous versions were released under the MIT license.