Quickstart
Install
Requires python 3.11 or later.
Build a dataset
ftm-datalake stores metadata for the files that then refers to the actual source files.
For example, take this public file listing archive: https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes/
Crawl these documents into a dataset:
ftm-datalake -d ddos_patriotfront crawl "https://data.ddosecrets.com/Patriot%20Front/patriotfront/2021/Organizational%20Documents%20and%20Notes"
The metadata and source files are now stored in the archive (./data by default).
Inspect files and archive
All metadata and other information lives in the ddos_patriotfront/.ftm-datalake subdirectory. Files are keyed and accessible by their (relative) path.
Retrieve file metadata:
Retrieve actual file blob:
Show all files metadata present in the dataset archive:
Show only the file paths:
Show only the checksums (sha1 by default):
Tracking changes
The make command (re-)generates the datasets metadata.
Delete a file:
Now regenerate:
The result output will indicate that 1 file was deleted.
configure storage
storage_config:
  uri: s3://my_bucket
  backend_kwargs:
    endpoint_url: https://s3.example.org
    aws_access_key_id: ${AWS_ACCESS_KEY_ID}
    aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
dataset config.yml
Follows the specification in ftmq.model.Dataset:
name: my_dataset #  also known as "foreign_id"
title: An awesome leak
description: >
  Incidunt eum asperiores impedit. Nobis est dolorem et quam autem quo. Name
  labore sequi maxime qui non voluptatum ducimus voluptas. Exercitationem enim
  similique asperiores quod et quae maiores. Et accusantium accusantium error
  et alias aut omnis eos. Omnis porro sit eum et.
updated_at: 2024-09-25
index_url: https://static.example.org/my_dataset/index.json
# add more metadata
ftm-datalake: # see above