Quickstart
Installation
Requires Python 3.11 or later.
Basic Concepts
ftm-lakehouse organizes data into datasets. Each dataset contains:
- Entities: Structured FollowTheMoney data
- Archive: Source documents and files
Using the Python API
Get a Dataset
from ftm_lakehouse import get_dataset, ensure_dataset
# Get existing dataset
dataset = get_dataset("my_dataset")
# Or create if it doesn't exist
dataset = ensure_dataset("my_dataset", title="My Dataset")
Working with Entities
from ftm_lakehouse import ensure_dataset
from followthemoney import model
dataset = ensure_dataset("my_dataset")
# Create an entity
person = model.make_entity("Person")
person.make_id("jane-doe")
person.add("name", "Jane Doe")
person.add("nationality", "us")
# Write the entity
dataset.entities.add(person, origin="manual")
# Flush to storage
dataset.entities.flush()
# Read it back
entity = dataset.entities.get(person.id)
print(f"Found: {entity.caption}")
Working with Files
from ftm_lakehouse import ensure_dataset
dataset = ensure_dataset("my_dataset")
# Archive a file
file = dataset.archive.put("/path/to/document.pdf")
print(f"Archived: {file.checksum}")
# Retrieve it
with dataset.archive.open(file) as fh:
content = fh.read()
Bulk Operations
For large imports, use bulk writers:
from ftm_lakehouse import ensure_dataset
dataset = ensure_dataset("my_dataset")
# Write many entities efficiently
with dataset.entities.writer(origin="bulk_import") as writer:
for entity in large_entity_source():
writer.add_entity(entity)
# Flush to parquet store
dataset.entities.flush()
Query Entities
# Query with filters
for entity in dataset.entities.query(origin="import"):
print(entity.caption)
# Stream from exported JSON
for entity in dataset.entities.stream():
print(entity.caption)
Using the CLI
Create a Dataset
Crawl Documents
# Crawl from a local directory
ftm-lakehouse -d my_dataset crawl /path/to/documents
# Crawl from HTTP source
ftm-lakehouse -d my_dataset crawl https://example.com/files/
Import Entities
Export Data
# Generate all exports
ftm-lakehouse -d my_dataset make --exports
# Stream entities
ftm-lakehouse -d my_dataset stream-entities
Work with Archive
# List archived files
ftm-lakehouse -d my_dataset archive ls
# Get file metadata
ftm-lakehouse -d my_dataset archive head <checksum>
# Retrieve file content
ftm-lakehouse -d my_dataset archive get <checksum> -o output.pdf
Configuration
Set the storage location via environment variable:
# Local storage
export LAKEHOUSE_URI=./data
# S3 storage
export LAKEHOUSE_URI=s3://my-bucket/lakehouse
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
For persistent journal storage (recommended for production):
Next Steps
- Usage Guide - Complete API usage guide
- Working with Entities - Deep dive into entity operations
- Working with Files - Learn about the file archive
- CLI Reference - Complete CLI documentation
- Configuration - Advanced configuration options