Crawl
Crawl a local or remote location of documents (that supports file listing) into a ftm-datalake
dataset. This operation stores the file metadata and actual file blobs in the configured archive.
This will create a new dataset or update an existing one. Incremental crawls are cached via the global ftm-datalake cache.
Crawls can add files to a dataset but never deletes non-existing files.
Basic usage
Crawl a local directory
Crawl a http location
The location needs to support file listing.
In this example, archives (zip, tar.gz, ...) will be extracted during import.
Crawl from a cloud bucket
In this example, only pdf files are crawled:
Under the hood, ftm-datalake
uses anystore which uses fsspec that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.
Extract
Source files can be extracted during import using patool. This has a few caveats:
- When enabling
--extract
, archives won't be stored but only their extracted members, keeping the original (archived) directory structure. - This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):
- To avoid this, use
--extract-ensure-subdir
to create a sub-directory named by its source archive to place the extracted members into. The result would look like:
- If keeping the source archives is desired, use
--extract-keep-source
Include / Exclude glob patterns
Only crawl a subdirectory:
--include "subdir/*"
Exclude .txt files from a subdirectory and all it's children:
--exclude "subdir/**/*.txt"
Reference
Crawl document collections from public accessible archives (or local folders)
crawl(uri, dataset, skip_existing=True, write_documents_db=True, exclude=None, include=None, origin=ORIGIN_ORIGINAL, source_file=None)
Crawl a local or remote location of documents into a ftm_datalake dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uri
|
Uri
|
local or remote location uri that supports file listing |
required |
dataset
|
DatasetArchive
|
ftm_datalake Dataset instance |
required |
skip_existing
|
bool | None
|
Don't re-crawl existing keys (doesn't check for checksum) |
True
|
write_documents_db
|
bool | None
|
Create csv-based document tables at the end of crawl run |
True
|
exclude
|
str | None
|
Exclude glob for file paths not to crawl |
None
|
include
|
str | None
|
Include glob for file paths to crawl |
None
|
origin
|
Origins | None
|
Origin of files (used for sub runs of crawl within a crawl job) |
ORIGIN_ORIGINAL
|
source_file
|
File | None
|
Source file (used for sub runs of crawl within a crawl job) |
None
|