Ingestors
SevenZipIngestor
ingestors.packages.SevenZipIngestor
File types
-
application/x-7z-compressed
-
application/7z-compressed
File extensions
-
.7z
-
.7zip
Bases: PackageSupport
, Ingestor
, ShellSupport
Source code in ingestors/packages/__init__.py
AccessIngestor
ingestors.tabular.access.AccessIngestor
File types
-
application/msaccess
-
application/x-msaccess
-
application/vnd.msaccess
-
application/vnd.ms-access
-
application/mdb
-
application/x-mdb
File extensions
- .mdb
Bases: Ingestor
, TableSupport
, ShellSupport
Source code in ingestors/tabular/access.py
AudioIngestor
ingestors.media.audio.AudioIngestor
File types
-
audio/mpeg
-
audio/mp3
-
audio/x-m4a
-
audio/x-hx-aac-adts
-
audio/x-wav
-
audio/mp4
-
audio/ogg
-
audio/vnd.wav
-
audio/flac
-
audio/x-ms-wma
-
audio/webm
File extensions
-
.wav
-
.mp3
-
.aac
-
.ac3
-
.m4a
-
.m4b
-
.ogg
-
.opus
-
.flac
-
.wma
Bases: Ingestor
, TimestampSupport
, TranscriptionSupport
Source code in ingestors/media/audio.py
BZ2Ingestor
ingestors.packages.BZ2Ingestor
File types
-
application/x-bzip
-
application/x-bzip2
-
multipart/x-bzip
-
multipart/x-bzip2
File extensions
-
.bz
-
.tbz
-
.bz2
-
.tbz2
Bases: SingleFilePackageIngestor
Source code in ingestors/packages/__init__.py
CalendarIngestor
ingestors.email.calendar.CalendarIngestor
File types
- text/calendar
File extensions
-
.ics
-
.ical
-
.icalendar
-
.ifb
Bases: Ingestor
, EncodingSupport
Source code in ingestors/email/calendar.py
CSVIngestor
ingestors.tabular.csv.CSVIngestor
Decode and ingest a CSV file.
This expects a properly formatted CSV file with a header in the first row.
File types
-
text/csv
-
text/tsv
-
text/tab-separated-values
File extensions
-
.csv
-
.tsv
Bases: Ingestor
, TableSupport
Decode and ingest a CSV file.
This expects a properly formatted CSV file with a header in the first row.
Source code in ingestors/tabular/csv.py
DBFIngestor
ingestors.tabular.dbf.DBFIngestor
File types
-
application/dbase
-
application/x-dbase
-
application/dbf
-
application/x-dbf
File extensions
- .dbf
Bases: Ingestor
, TableSupport
Source code in ingestors/tabular/dbf.py
DjVuIngestor
ingestors.documents.djvu.DjVuIngestor
Read DejaVu E-Books.
File types
-
image/vnd.djvu
-
image/x.djvu
-
image/x-djvu
-
image/djvu
File extensions
Bases: Ingestor
, PDFSupport
, TempFileSupport
Read DejaVu E-Books.
Source code in ingestors/documents/djvu.py
ingest(file_path, entity)
Ingestor implementation.
Source code in ingestors/documents/djvu.py
AppleEmlxIngestor
ingestors.email.emlx.AppleEmlxIngestor
File types
File extensions
- .emlx
Bases: RFC822Ingestor
Source code in ingestors/email/emlx.py
GzipIngestor
ingestors.packages.GzipIngestor
File types
-
application/gzip
-
application/x-gzip
-
multipart/x-gzip
File extensions
-
.gz
-
.tgz
Bases: SingleFilePackageIngestor
Source code in ingestors/packages/__init__.py
HTMLIngestor
ingestors.documents.html.HTMLIngestor
HTML file ingestor class. Extracts the text from the web page.
File types
- text/html
File extensions
-
.htm
-
.html
-
.xhtml
Bases: Ingestor
, EncodingSupport
, HTMLSupport
HTML file ingestor class. Extracts the text from the web page.
Source code in ingestors/documents/html.py
ingest(file_path, entity)
IgnoreIngestor
ingestors.ignore.IgnoreIngestor
File types
-
application/x-pkcs7-mime
-
application/pkcs7-mime
-
application/pkcs7-signature
-
application/x-pkcs7-signature
-
application/x-pkcs12application/pgp-encrypted
-
application/x-shockwave-flash
-
application/vnd.apple.pkpass
-
application/x-executable
-
application/x-mach-binary
-
application/x-sharedlib
-
application/x-dosexec
-
application/x-java-keystore
-
application/java-archive
-
application/font-sfnt
-
application/vnd.ms-office.vbaproject
-
application/x-x509-ca-cert
-
text/calendar
-
text/css
-
application/vnd.ms-opentype
-
application/x-font-ttf
File extensions
-
.json
-
.exe
-
.dll
-
.ini
-
.class
-
.jar
-
.psd
-
.indd
-
.sql
-
.dat
-
.log
-
.pbl
-
.p7m
-
.plist
-
.ics
-
.axd
Bases: Ingestor
Source code in ingestors/ignore.py
ImageIngestor
ingestors.media.image.ImageIngestor
Image file ingestor class. Extracts the text from images using OCR.
File types
-
image/x-portable-graymap
-
image/png
-
image/x-png
-
image/jpeg
-
image/jpg
-
image/gif
-
image/pjpeg
-
image/bmp
-
image/x-windows-bmp
-
image/x-portable-bitmap
-
image/x-coreldraw
-
application/postscript
-
image/vnd.dxf
File extensions
-
.jpg
-
.jpe
-
.jpeg
-
.png
-
.gif
-
.bmp
Bases: Ingestor
, OCRSupport
, TimestampSupport
Image file ingestor class. Extracts the text from images using OCR.
Source code in ingestors/media/image.py
JSONIngestor
ingestors.misc.jsonfile.JSONIngestor
File types
-
application/json
-
text/javascript
File extensions
- .json
Bases: Ingestor
, EncodingSupport
Source code in ingestors/misc/jsonfile.py
MboxFileIngestor
ingestors.email.mbox.MboxFileIngestor
File types
- application/mbox
File extensions
- .mbox
Bases: RFC822Ingestor
, TempFileSupport
Source code in ingestors/email/mbox.py
RFC822Ingestor
ingestors.email.msg.RFC822Ingestor
File types
-
multipart/mixed
-
message/rfc822
File extensions
-
.eml
-
.rfc822
-
.email
-
.msg
Bases: Ingestor
, EmailSupport
, EncodingSupport
Source code in ingestors/email/msg.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
OpenOfficeSpreadsheetIngestor
ingestors.tabular.ods.OpenOfficeSpreadsheetIngestor
File types
-
application/vnd.oasis.opendocument.spreadsheet
-
application/vnd.oasis.opendocument.spreadsheet-template
File extensions
-
.ods
-
.ots
Bases: Ingestor
, TableSupport
, OpenDocumentSupport
Source code in ingestors/tabular/ods.py
DocumentIngestor
ingestors.documents.office.DocumentIngestor
Office/Word document ingestor class.
Converts the document to PDF and extracts the text. Mostly a slightly adjusted PDF ingestor.
Requires system tools:
- Open/Libre Office with dependencies
- image ingestor dependencies to cover any embeded images OCR
File types
-
text/richtext
-
text/rtf
-
application/rtf
-
application/x-rtf
-
application/msword
-
application/vnd.ms-word
-
application/wordperfect
-
application/vnd.wordperfect
-
application/vnd.ms-powerpoint
-
application/vnd.sun.xml.impress
-
application/vnd.ms-powerpoint.presentation
-
application/vnd.ms-powerpoint.presentation.12
-
application/CDFV2-unknown
-
application/CDFV2-corruptapplication/clarisworks
-
application/epub+zip
-
application/macwriteii
-
application/msword
-
application/prs.plucker
-
application/vnd.corel-draw
-
application/vnd.lotus-wordpro
-
application/vnd.ms-powerpoint
-
application/vnd.ms-powerpoint.presentation.macroEnabled.main+xml
-
application/vnd.ms-works
-
application/vnd.palm
-
application/vnd.sun.xml.draw
-
application/vnd.sun.xml.draw.template
-
application/vnd.sun.xml.impress
-
application/vnd.sun.xml.impress.template
-
application/vnd.sun.xml.writer
-
application/vnd.sun.xml.writer.global
-
application/vnd.sun.xml.writer.template
-
application/vnd.sun.xml.writer.web
-
application/vnd.visio
-
application/vnd.wordperfect
-
application/x-abiword
-
application/x-aportisdoc
-
application/x-fictionbook+xml
-
application/x-hwp
-
application/x-iwork-keynote-sffkey
-
application/x-iwork-pages-sffpages
-
application/x-mspublisher
-
application/x-mswrite
-
application/x-pagemaker
-
application/x-sony-bbeb
-
application/x-t602
-
image/x-cmx
-
image/x-freehand
-
image/x-wpg
File extensions
-
.602
-
.abw
-
.cdr
-
.cmx
-
.cwk
-
.doc
-
.dot
-
.dps
-
.dpt
-
.epub
-
.fb2
-
.fh
-
.fh1
-
.fh10
-
.fh11
-
.fh2
-
.fh3
-
.fh4
-
.fh5
-
.fh6
-
.fh7
-
.fh8
-
.fh9
-
.fodg
-
.fodp
-
.fodt
-
.hwp
-
.key
-
.lrf
-
.lwp
-
.mcw
-
.mw
-
.mwd
-
.nxd
-
.odg
-
.odm
-
.otg
-
.oth
-
.otm
-
.otp
-
.ott
-
.p65
-
.pages
-
.pdb
-
.pm
-
.pm6
-
.pmd
-
.pot
-
.pps
-
.ppt
-
.pub
-
.qxd
-
.qxt
-
.rtf
-
.sda
-
.sdd
-
.sdw
-
.std
-
.sti
-
.stw
-
.sxd
-
.sxg
-
.sxi
-
.sxw
-
.vdx
-
.vsd
-
.vsdm
-
.vsdx
-
.wn
-
.wpd
-
.wpg
-
.wps
-
.wpt
-
.wri
-
.xlc
-
.xlm
-
.xls
-
.xlw
-
.zabw
-
.zmf
Bases: Ingestor
, OLESupport
, PDFSupport
Office/Word document ingestor class.
Converts the document to PDF and extracts the text. Mostly a slightly adjusted PDF ingestor.
Requires system tools:
- Open/Libre Office with dependencies
- image ingestor dependencies to cover any embeded images OCR
Source code in ingestors/documents/office.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
ingest(file_path, entity)
Ingestor implementation.
Source code in ingestors/documents/office.py
OutlookMsgIngestor
ingestors.email.outlookmsg.OutlookMsgIngestor
File types
-
application/msg
-
application/x-msg
-
application/vnd.ms-outlook
-
msg/rfc822
File extensions
- .msg
Bases: Ingestor
, EmailSupport
, OLESupport
, TempFileSupport
Source code in ingestors/email/outlookmsg.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
OutlookOLMArchiveIngestor
ingestors.email.olm.OutlookOLMArchiveIngestor
File types
File extensions
- .olm
Bases: Ingestor
, TempFileSupport
, XMLSupport
Source code in ingestors/email/olm.py
extract_attachment(zipf, message, attachment)
Create an entity for an attachment; assign its parent and put it on the task queue to be processed
Source code in ingestors/email/olm.py
extract_file(zipf, name)
Extract a message file from the OLM zip archive
Source code in ingestors/email/olm.py
extract_hierarchy(entity, name)
Given a file path, create all its ancestor folders as entities
Source code in ingestors/email/olm.py
OfficeOpenXMLIngestor
ingestors.documents.ooxml.OfficeOpenXMLIngestor
Office/Word document ingestor class.
Converts the document to PDF and extracts the text. Mostly a slightly adjusted PDF ingestor.
File types
-
application/vnd.openxmlformats-officedocument.wordprocessingml.document
-
application/vnd.openxmlformats-officedocument.wordprocessingml.template
-
application/vnd.openxmlformats-officedocument.presentationml.slideshow
-
application/vnd.openxmlformats-officedocument.presentationml.presentation
-
application/vnd.openxmlformats-officedocument.presentationml.template
-
application/vnd.openxmlformats-officedocument.presentationml.slideshow
File extensions
-
.docx
-
.docm
-
.dotx
-
.dotm
-
.potx
-
.pptx
-
.ppsx
-
.pptm
-
.ppsm
-
.potm
Bases: Ingestor
, OOXMLSupport
, PDFSupport
Office/Word document ingestor class.
Converts the document to PDF and extracts the text. Mostly a slightly adjusted PDF ingestor.
Source code in ingestors/documents/ooxml.py
ingest(file_path, entity)
Ingestor implementation.
Source code in ingestors/documents/ooxml.py
OpenDocumentIngestor
ingestors.documents.opendoc.OpenDocumentIngestor
Office/Word document ingestor class.
Converts the document to PDF and extracts the text. Mostly a slightly adjusted PDF ingestor.
Requires system tools:
- Open/Libre Office with dependencies
- image ingestor dependencies to cover any embeded images OCR
File types
-
application/vnd.oasis.opendocument.text
-
application/vnd.oasis.opendocument.text-template
-
application/vnd.oasis.opendocument.presentation
-
application/vnd.oasis.opendocument.graphics
-
application/vnd.oasis.opendocument.graphics-flat-xml
-
application/vnd.oasis.opendocument.graphics-templateapplication/vnd.oasis.opendocument.presentation-flat-xml
-
application/vnd.oasis.opendocument.presentation-template
-
application/vnd.oasis.opendocument.chart
-
application/vnd.oasis.opendocument.chart-template
-
application/vnd.oasis.opendocument.image
-
application/vnd.oasis.opendocument.image-template
-
application/vnd.oasis.opendocument.formula
-
application/vnd.oasis.opendocument.formula-template
-
application/vnd.oasis.opendocument.text-flat-xml
-
application/vnd.oasis.opendocument.text-master
-
application/vnd.oasis.opendocument.text-web
File extensions
-
.odt
-
.odp
-
.otp
Bases: Ingestor
, OpenDocumentSupport
, PDFSupport
Office/Word document ingestor class.
Converts the document to PDF and extracts the text. Mostly a slightly adjusted PDF ingestor.
Requires system tools:
- Open/Libre Office with dependencies
- image ingestor dependencies to cover any embeded images OCR
Source code in ingestors/documents/opendoc.py
ingest(file_path, entity)
Ingestor implementation.
Source code in ingestors/documents/opendoc.py
OutlookOLMMessageIngestor
ingestors.email.olm.OutlookOLMMessageIngestor
File types
- application/xml+opfmessage
File extensions
Bases: Ingestor
, XMLSupport
, EmailSupport
, TimestampSupport
Source code in ingestors/email/olm.py
PDFIngestor
ingestors.documents.pdf.PDFIngestor
PDF file ingestor class.
Extracts the text from the document by converting it first to XML. Splits the file into pages.
File types
- application/pdf
File extensions
Bases: Ingestor
, PDFSupport
PDF file ingestor class.
Extracts the text from the document by converting it first to XML. Splits the file into pages.
Source code in ingestors/documents/pdf.py
ingest(file_path, entity)
Ingestor implementation.
Source code in ingestors/documents/pdf.py
PlainTextIngestor
ingestors.documents.plain.PlainTextIngestor
Plan text file ingestor class.
Extracts the text from the document and enforces unicode on it.
File types
-
text/plain
-
text/x-c
-
text/x-c++
-
text/x-diff
-
text/x-python
-
text/x-shellscript
-
text/x-java
-
text/x-php
-
text/troff
-
text/x-ruby
-
text/x-pascal
-
text/x-msdos-batch
-
text/x-yaml
-
text/x-makefile
-
text/x-perl
-
text/x-objective-c
-
text/x-msdos-batch
-
text/x-asm
-
text/x-csrc
-
text/x-sh
-
text/javascript
-
text/x-algol68
File extensions
-
.txt
-
.md
-
.rst
-
.nfo
Bases: Ingestor
, EncodingSupport
Plan text file ingestor class.
Extracts the text from the document and enforces unicode on it.
Source code in ingestors/documents/plain.py
ingest(file_path, entity)
Ingestor implementation.
Source code in ingestors/documents/plain.py
OutlookPSTIngestor
ingestors.email.outlookpst.OutlookPSTIngestor
File types
- application/vnd.ms-outlook
File extensions
-
.pst
-
.ost
-
.pab
Bases: Ingestor
, TempFileSupport
, OLESupport
, ShellSupport
Source code in ingestors/email/outlookpst.py
RARIngestor
ingestors.packages.rar.RARIngestor
File types
- application/rarapplication/x-rar
File extensions
- .rar
Bases: PackageSupport
, Ingestor
Source code in ingestors/packages/rar.py
SQLiteIngestor
ingestors.tabular.sqlite.SQLiteIngestor
File types
-
application/x-sqlite3
-
application/x-sqlite
-
application/sqlite3
-
application/sqlite
File extensions
-
.sqlite3
-
.sqlite
-
.db
Bases: Ingestor
, TableSupport
Source code in ingestors/tabular/sqlite.py
SVGIngestor
ingestors.media.svg.SVGIngestor
File types
- image/svg+xml
File extensions
- .svg
Bases: Ingestor
, EncodingSupport
, HTMLSupport
Source code in ingestors/media/svg.py
TarIngestor
ingestors.packages.tar.TarIngestor
File types
-
application/tar
-
application/x-tar
-
application/x-tgz
-
application/x-gtar
File extensions
- .tar
Bases: PackageSupport
, Ingestor
Source code in ingestors/packages/tar.py
TIFFIngestor
ingestors.media.tiff.TIFFIngestor
TIFF appears to not really be an image format. Who knew?
File types
-
image/tiff
-
image/x-tiff
File extensions
-
.tif
-
.tiff
Bases: Ingestor
, PDFSupport
, TempFileSupport
, ShellSupport
TIFF appears to not really be an image format. Who knew?
Source code in ingestors/media/tiff.py
VCardIngestor
ingestors.email.vcard.VCardIngestor
File types
-
text/vcard
-
text/x-vcard
File extensions
-
.vcf
-
.vcard
Bases: Ingestor
, EncodingSupport
Source code in ingestors/email/vcard.py
VideoIngestor
ingestors.media.video.VideoIngestor
File types
-
application/x-shockwave-flash
-
video/quicktime
-
video/mp4
-
video/x-flv
File extensions
-
.avi
-
.mpg
-
.mpeg
-
.mkv
-
.mp4
-
.mov
Bases: Ingestor
, TimestampSupport
, TranscriptionSupport
Source code in ingestors/media/video.py
ExcelIngestor
ingestors.tabular.xls.ExcelIngestor
File types
-
application/excel
-
application/x-excel
-
application/vnd.ms-excel
-
application/x-msexcel
File extensions
-
.xls
-
.xlt
-
.xla
Bases: Ingestor
, TableSupport
, OLESupport
Source code in ingestors/tabular/xls.py
ExcelXMLIngestor
ingestors.tabular.xlsx.ExcelXMLIngestor
File types
-
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
-
application/vnd.openxmlformats-officedocument.spreadsheetml.template
-
application/vnd.ms-excel.sheet.macroenabled.12
-
application/vnd.ms-excel.sheet.binary.macroenabled.12
-
application/vnd.ms-excel.template.macroenabled.12
-
application/vnd.ms-excel.sheet.macroEnabled.main+xml
File extensions
-
.xlsx
-
.xlsm
-
.xltx
-
.xltm
Bases: Ingestor
, TableSupport
, OOXMLSupport
Source code in ingestors/tabular/xlsx.py
XMLIngestor
ingestors.documents.xml.XMLIngestor
XML file ingestor class. Generates a tabular HTML representation.
File types
- text/xml
File extensions
- .xml
Bases: Ingestor
, EncodingSupport
, XMLSupport
, HTMLSupport
XML file ingestor class. Generates a tabular HTML representation.
Source code in ingestors/documents/xml.py
ingest(file_path, entity)
Ingestor implementation.
Source code in ingestors/documents/xml.py
ZipIngestor
ingestors.packages.zip.ZipIngestor
File types
-
application/zip
-
application/x-zip
-
multipart/x-zip
-
application/zip-compressed
-
application/x-zip-compressed
File extensions
- .zip
Bases: PackageSupport
, Ingestor