Crawl

Crawl a local or remote location of documents (that supports file listing) into a ftm-datalake dataset. This operation stores the file metadata and actual file blobs in the configured archive.

This will create a new dataset or update an existing one. Incremental crawls are cached via the global ftm-datalake cache.

Crawls can add files to a dataset but never deletes non-existing files.

Basic usage

Crawl a local directory

ftm-datalake -d my_dataset crawl /data/dump1/

Crawl a http location

The location needs to support file listing.

In this example, archives (zip, tar.gz, ...) will be extracted during import.

ftm-datalake -d ddos_blueleaks crawl --extract https://data.ddosecrets.com/BlueLeaks/

Crawl from a cloud bucket

In this example, only pdf files are crawled:

ftm-datalake -d my_dataset crawl --include "*.pdf" s3://my_bucket/files

Under the hood, ftm-datalake uses anystore which uses fsspec that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.

Extract

Source files can be extracted during import using patool. This has a few caveats:

When enabling --extract, archives won't be stored but only their extracted members, keeping the original (archived) directory structure.
This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):

archive1.zip
    subdir1/file.pdf

archive2.zip
    subdir1/file.pdf

To avoid this, use --extract-ensure-subdir to create a sub-directory named by its source archive to place the extracted members into. The result would look like:

archive1.zip/subdir1/file.pdf
archive2.zip/subdir1/file.pdf

If keeping the source archives is desired, use --extract-keep-source

Include / Exclude glob patterns

Only crawl a subdirectory:

--include "subdir/*"

Exclude .txt files from a subdirectory and all it's children:

--exclude "subdir/**/*.txt"

Reference

Crawl document collections from public accessible archives (or local folders)

`crawl(uri, dataset, skip_existing=True, write_documents_db=True, exclude=None, include=None, origin=ORIGIN_ORIGINAL, source_file=None)`

Crawl a local or remote location of documents into a ftm_datalake dataset.

Parameters:

Name	Type	Description	Default
`uri`	`Uri`	local or remote location uri that supports file listing	required
`dataset`	`DatasetArchive`	ftm_datalake Dataset instance	required
`skip_existing`	`bool \| None`	Don't re-crawl existing keys (doesn't check for checksum)	`True`
`write_documents_db`	`bool \| None`	Create csv-based document tables at the end of crawl run	`True`
`exclude`	`str \| None`	Exclude glob for file paths not to crawl	`None`
`include`	`str \| None`	Include glob for file paths to crawl	`None`
`origin`	`Origins \| None`	Origin of files (used for sub runs of crawl within a crawl job)	`ORIGIN_ORIGINAL`
`source_file`	`File \| None`	Source file (used for sub runs of crawl within a crawl job)	`None`

Source code in ftm_datalake/crawl.py

def crawl(
    uri: Uri,
    dataset: DatasetArchive,
    skip_existing: bool | None = True,
    write_documents_db: bool | None = True,
    exclude: str | None = None,
    include: str | None = None,
    origin: Origins | None = ORIGIN_ORIGINAL,
    source_file: File | None = None,
) -> CrawlStatus:
    """
    Crawl a local or remote location of documents into a ftm_datalake dataset.

    Args:
        uri: local or remote location uri that supports file listing
        dataset: ftm_datalake Dataset instance
        skip_existing: Don't re-crawl existing keys (doesn't check for checksum)
        write_documents_db: Create csv-based document tables at the end of crawl run
        exclude: Exclude glob for file paths not to crawl
        include: Include glob for file paths to crawl
        origin: Origin of files (used for sub runs of crawl within a crawl job)
        source_file: Source file (used for sub runs of crawl within a crawl job)
    """
    remote_store = get_store(uri=uri)
    # FIXME ensure long timeouts
    if remote_store.scheme.startswith("http"):
        backend_config = ensure_dict(remote_store.backend_config)
        backend_config["client_kwargs"] = {
            **ensure_dict(backend_config.get("client_kwargs")),
            "timeout": aiohttp.ClientTimeout(total=3600 * 24),
        }
        remote_store.backend_config = backend_config
    worker = CrawlWorker(
        remote_store,
        dataset=dataset,
        skip_existing=skip_existing,
        write_documents_db=write_documents_db,
        exclude=exclude,
        include=include,
        origin=origin,
        source_file=source_file,
    )
    return worker.run()