Skip to content

Crawl

Crawl a local or remote location of documents (that supports file listing) into a ftm-datalake dataset. This operation stores the file metadata and actual file blobs in the configured archive.

This will create a new dataset or update an existing one. Incremental crawls are cached via the global ftm-datalake cache.

Crawls can add files to a dataset but never deletes non-existing files.

Basic usage

Crawl a local directory

ftm-datalake -d my_dataset crawl /data/dump1/

Crawl a http location

The location needs to support file listing.

In this example, archives (zip, tar.gz, ...) will be extracted during import.

ftm-datalake -d ddos_blueleaks crawl --extract https://data.ddosecrets.com/BlueLeaks/

Crawl from a cloud bucket

In this example, only pdf files are crawled:

ftm-datalake -d my_dataset crawl --include "*.pdf" s3://my_bucket/files

Under the hood, ftm-datalake uses anystore which uses fsspec that allows a wide range of filesystem-like sources. For some, installing additional dependencies might be required.

Extract

Source files can be extracted during import using patool. This has a few caveats:

  • When enabling --extract, archives won't be stored but only their extracted members, keeping the original (archived) directory structure.
  • This can lead to file conflicts, if several archives have the same directory structure (file.pdf from archive2.zip would replace the previous one):
archive1.zip
    subdir1/file.pdf

archive2.zip
    subdir1/file.pdf
  • To avoid this, use --extract-ensure-subdir to create a sub-directory named by its source archive to place the extracted members into. The result would look like:
archive1.zip/subdir1/file.pdf
archive2.zip/subdir1/file.pdf
  • If keeping the source archives is desired, use --extract-keep-source

Include / Exclude glob patterns

Only crawl a subdirectory:

--include "subdir/*"

Exclude .txt files from a subdirectory and all it's children:

--exclude "subdir/**/*.txt"

Reference

Crawl document collections from public accessible archives (or local folders)

crawl(uri, dataset, skip_existing=True, write_documents_db=True, exclude=None, include=None, origin=ORIGIN_ORIGINAL, source_file=None)

Crawl a local or remote location of documents into a ftm_datalake dataset.

Parameters:

Name Type Description Default
uri Uri

local or remote location uri that supports file listing

required
dataset DatasetArchive

ftm_datalake Dataset instance

required
skip_existing bool | None

Don't re-crawl existing keys (doesn't check for checksum)

True
write_documents_db bool | None

Create csv-based document tables at the end of crawl run

True
exclude str | None

Exclude glob for file paths not to crawl

None
include str | None

Include glob for file paths to crawl

None
origin Origins | None

Origin of files (used for sub runs of crawl within a crawl job)

ORIGIN_ORIGINAL
source_file File | None

Source file (used for sub runs of crawl within a crawl job)

None
Source code in ftm_datalake/crawl.py
def crawl(
    uri: Uri,
    dataset: DatasetArchive,
    skip_existing: bool | None = True,
    write_documents_db: bool | None = True,
    exclude: str | None = None,
    include: str | None = None,
    origin: Origins | None = ORIGIN_ORIGINAL,
    source_file: File | None = None,
) -> CrawlStatus:
    """
    Crawl a local or remote location of documents into a ftm_datalake dataset.

    Args:
        uri: local or remote location uri that supports file listing
        dataset: ftm_datalake Dataset instance
        skip_existing: Don't re-crawl existing keys (doesn't check for checksum)
        write_documents_db: Create csv-based document tables at the end of crawl run
        exclude: Exclude glob for file paths not to crawl
        include: Include glob for file paths to crawl
        origin: Origin of files (used for sub runs of crawl within a crawl job)
        source_file: Source file (used for sub runs of crawl within a crawl job)
    """
    remote_store = get_store(uri=uri)
    # FIXME ensure long timeouts
    if remote_store.scheme.startswith("http"):
        backend_config = ensure_dict(remote_store.backend_config)
        backend_config["client_kwargs"] = {
            **ensure_dict(backend_config.get("client_kwargs")),
            "timeout": aiohttp.ClientTimeout(total=3600 * 24),
        }
        remote_store.backend_config = backend_config
    worker = CrawlWorker(
        remote_store,
        dataset=dataset,
        skip_existing=skip_existing,
        write_documents_db=write_documents_db,
        exclude=exclude,
        include=include,
        origin=origin,
        source_file=source_file,
    )
    return worker.run()